diff mbox series

[v8,01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()

Message ID 20240925232425.2763385-2-ankur.a.arora@oracle.com
State New
Headers show
Series Enable haltpoll on arm64 | expand

Commit Message

Ankur Arora Sept. 25, 2024, 11:24 p.m. UTC
From: Mihai Carabas <mihai.carabas@oracle.com>

The inner loop in poll_idle() polls up to POLL_IDLE_RELAX_COUNT times,
checking to see if the thread has the TIF_NEED_RESCHED bit set. The
loop exits once the condition is met, or if the poll time limit has
been exceeded.

To minimize the number of instructions executed each iteration, the
time check is done only infrequently (once every POLL_IDLE_RELAX_COUNT
iterations). In addition, each loop iteration executes cpu_relax()
which on certain platforms provides a hint to the pipeline that the
loop is busy-waiting, thus allowing the processor to reduce power
consumption.

However, cpu_relax() is defined optimally only on x86. On arm64, for
instance, it is implemented as a YIELD which only serves a hint to the
CPU that it prioritize a different hardware thread if one is available.
arm64, however, does expose a more optimal polling mechanism via
smp_cond_load_relaxed() which uses LDXR, WFE to wait until a store
to a specified region.

So restructure the loop, folding both checks in smp_cond_load_relaxed().
Also, move the time check to the head of the loop allowing it to exit
straight-away once TIF_NEED_RESCHED is set.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Misono Tomohiro <misono.tomohiro@fujitsu.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 drivers/cpuidle/poll_state.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

Comments

Catalin Marinas Oct. 15, 2024, 12:04 p.m. UTC | #1
On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote:
> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
> index 9b6d90a72601..fc1204426158 100644
> --- a/drivers/cpuidle/poll_state.c
> +++ b/drivers/cpuidle/poll_state.c
> @@ -21,21 +21,20 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,
>  
>  	raw_local_irq_enable();
>  	if (!current_set_polling_and_test()) {
> -		unsigned int loop_count = 0;
>  		u64 limit;
>  
>  		limit = cpuidle_poll_time(drv, dev);
>  
>  		while (!need_resched()) {
> -			cpu_relax();
> -			if (loop_count++ < POLL_IDLE_RELAX_COUNT)
> -				continue;
> -
> -			loop_count = 0;
> +			unsigned int loop_count = 0;
>  			if (local_clock_noinstr() - time_start > limit) {
>  				dev->poll_time_limit = true;
>  				break;
>  			}
> +
> +			smp_cond_load_relaxed(&current_thread_info()->flags,
> +					      VAL & _TIF_NEED_RESCHED ||
> +					      loop_count++ >= POLL_IDLE_RELAX_COUNT);

The above is not guaranteed to make progress if _TIF_NEED_RESCHED is
never set. With the event stream enabled on arm64, the WFE will
eventually be woken up, loop_count incremented and the condition would
become true. However, the smp_cond_load_relaxed() semantics require that
a different agent updates the variable being waited on, not the waiting
CPU updating it itself. Also note that the event stream can be disabled
on arm64 on the kernel command line.

Does the code above break any other architecture? I'd say if you want
something like this, better introduce a new smp_cond_load_timeout()
API. The above looks like a hack that may only work on arm64 when the
event stream is enabled.

A generic option is udelay() (on arm64 it would use WFE/WFET by
default). Not sure how important it is for poll_idle() but the downside
of udelay() that it won't be able to also poll need_resched() while
waiting for the timeout. If this matters, you could instead make smaller
udelay() calls. Yet another problem, I don't know how energy efficient
udelay() is on x86 vs cpu_relax().

So maybe an smp_cond_load_timeout() would be better, implemented with
cpu_relax() generically and the arm64 would use LDXR, WFE and rely on
the event stream (or fall back to cpu_relax() if the event stream is
disabled).
Christoph Lameter (Ampere) Oct. 15, 2024, 4:42 p.m. UTC | #2
On Tue, 15 Oct 2024, Catalin Marinas wrote:

> > +			unsigned int loop_count = 0;
> >  			if (local_clock_noinstr() - time_start > limit) {
> >  				dev->poll_time_limit = true;
> >  				break;
> >  			}
> > +
> > +			smp_cond_load_relaxed(&current_thread_info()->flags,
> > +					      VAL & _TIF_NEED_RESCHED ||
> > +					      loop_count++ >= POLL_IDLE_RELAX_COUNT);
>
> The above is not guaranteed to make progress if _TIF_NEED_RESCHED is
> never set. With the event stream enabled on arm64, the WFE will
> eventually be woken up, loop_count incremented and the condition would
> become true. However, the smp_cond_load_relaxed() semantics require that
> a different agent updates the variable being waited on, not the waiting
> CPU updating it itself. Also note that the event stream can be disabled
> on arm64 on the kernel command line.


Setting of need_resched() from another processor involves sending an IPI
after that was set. I dont think we need to smp_cond_load_relaxed since
the IPI will cause an event. For ARM a WFE would be sufficient.
Christoph Lameter (Ampere) Oct. 15, 2024, 5:17 p.m. UTC | #3
On Tue, 15 Oct 2024, Catalin Marinas wrote:

> > Setting of need_resched() from another processor involves sending an IPI
> > after that was set. I dont think we need to smp_cond_load_relaxed since
> > the IPI will cause an event. For ARM a WFE would be sufficient.
>
> I'm not worried about the need_resched() case, even without an IPI it
> would still work.
>
> The loop_count++ side of the condition is supposed to timeout in the
> absence of a need_resched() event. You can't do an smp_cond_load_*() on
> a variable that's only updated by the waiting CPU. Nothing guarantees to
> wake it up to update the variable (the event stream on arm64, yes, but
> that's generic code).

Hmm... I have WFET implementation here without smp_cond modelled after
the delay() implementation ARM64 (but its not generic and there is
an additional patch required to make this work. Intermediate patch
attached)


From: Christoph Lameter (Ampere) <cl@gentwo.org>
Subject: [Haltpoll: Implement waiting using WFET

Use WFET if the hardware supports it to implement
a wait until something happens to wake up the cpu.

If WFET is not available then use the stream event
source to periodically wake up until an event happens
or the timeout expires.

The smp_cond_wait() is not necessary because the scheduler
will create an event on the targeted cpu by sending an IPI.

Without cond_wait we can simply take the basic approach
from the delay() function and customize it a bit.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 drivers/cpuidle/poll_state.c | 43 +++++++++++++++++-------------------
 1 file changed, 20 insertions(+), 23 deletions(-)

Index: linux/drivers/cpuidle/poll_state.c
===================================================================
--- linux.orig/drivers/cpuidle/poll_state.c
+++ linux/drivers/cpuidle/poll_state.c
@@ -5,48 +5,41 @@

 #include <linux/cpuidle.h>
 #include <linux/sched.h>
-#include <linux/sched/clock.h>
 #include <linux/sched/idle.h>
-
-#ifdef CONFIG_ARM64
-/*
- * POLL_IDLE_RELAX_COUNT determines how often we check for timeout
- * while polling for TIF_NEED_RESCHED in thread_info->flags.
- *
- * Set this to a low value since arm64, instead of polling, uses a
- * event based mechanism.
- */
-#define POLL_IDLE_RELAX_COUNT	1
-#else
-#define POLL_IDLE_RELAX_COUNT	200
-#endif
+#include <clocksource/arm_arch_timer.h>

 static int __cpuidle poll_idle(struct cpuidle_device *dev,
 			       struct cpuidle_driver *drv, int index)
 {
-	u64 time_start;
-
-	time_start = local_clock_noinstr();
+	const cycles_t start = get_cycles();

 	dev->poll_time_limit = false;

 	raw_local_irq_enable();
 	if (!current_set_polling_and_test()) {
-		u64 limit;

-		limit = cpuidle_poll_time(drv, dev);
+		const cycles_t end = start + ARCH_TIMER_NSECS_TO_CYCLES(cpuidle_poll_time(drv, dev));

 		while (!need_resched()) {
-			unsigned int loop_count = 0;
-			if (local_clock_noinstr() - time_start > limit) {
-				dev->poll_time_limit = true;
-				break;
-			}

-			smp_cond_load_relaxed(&current_thread_info()->flags,
-					      VAL & _TIF_NEED_RESCHED ||
-					      loop_count++ >= POLL_IDLE_RELAX_COUNT);
+			if (alternative_has_cap_unlikely(ARM64_HAS_WFXT)) {
+
+				/* We can power down for a configurable interval while waiting */
+				while (!need_resched() && get_cycles() < end)
+						wfet(end);
+
+			} else if (arch_timer_evtstrm_available()) {
+				const cycles_t timer_period = ARCH_TIMER_USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US);
+
+				/* Wake up periodically during evstream events */
+				while (!need_resched() && get_cycles() + timer_period <= end)
+						wfe();
+			}
 		}
+
+		/* In case time is not up yet due to coarse time intervals above */
+		while (!need_resched() && get_cycles() < end)
+					cpu_relax();
 	}
 	raw_local_irq_disable();
From: Christoph Lameter (Ampere) <cl@linux.com>

Move the conversion from time to cycles of arch_timer
into arch_timer.h. Add nsec conversion since we will need that soon.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/arch/arm64/lib/delay.c
===================================================================
--- linux.orig/arch/arm64/lib/delay.c
+++ linux/arch/arm64/lib/delay.c
@@ -15,14 +15,6 @@
 
 #include <clocksource/arm_arch_timer.h>
 
-#define USECS_TO_CYCLES(time_usecs)			\
-	xloops_to_cycles((time_usecs) * 0x10C7UL)
-
-static inline unsigned long xloops_to_cycles(unsigned long xloops)
-{
-	return (xloops * loops_per_jiffy * HZ) >> 32;
-}
-
 void __delay(unsigned long cycles)
 {
 	cycles_t start = get_cycles();
@@ -39,7 +31,7 @@ void __delay(unsigned long cycles)
 			wfet(end);
 	} else 	if (arch_timer_evtstrm_available()) {
 		const cycles_t timer_evt_period =
-			USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US);
+			ARCH_TIMER_USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US);
 
 		while ((get_cycles() - start + timer_evt_period) < cycles)
 			wfe();
@@ -52,7 +44,7 @@ EXPORT_SYMBOL(__delay);
 
 inline void __const_udelay(unsigned long xloops)
 {
-	__delay(xloops_to_cycles(xloops));
+	__delay(arch_timer_xloops_to_cycles(xloops));
 }
 EXPORT_SYMBOL(__const_udelay);
 
Index: linux/include/clocksource/arm_arch_timer.h
===================================================================
--- linux.orig/include/clocksource/arm_arch_timer.h
+++ linux/include/clocksource/arm_arch_timer.h
@@ -90,6 +90,19 @@ extern u64 (*arch_timer_read_counter)(vo
 extern struct arch_timer_kvm_info *arch_timer_get_kvm_info(void);
 extern bool arch_timer_evtstrm_available(void);
 
+#include <linux/delay.h>
+
+static inline unsigned long arch_timer_xloops_to_cycles(unsigned long xloops)
+{
+	return (xloops * loops_per_jiffy * HZ) >> 32;
+}
+
+#define ARCH_TIMER_USECS_TO_CYCLES(time_usecs)			\
+	arch_timer_xloops_to_cycles((time_usecs) * 0x10C7UL)
+
+#define ARCH_TIMER_NSECS_TO_CYCLES(time_nsecs)			\
+	arch_timer_xloops_to_cycles((time_nsecs) * 0x5UL)
+
 #else
 
 static inline u32 arch_timer_get_rate(void)
Catalin Marinas Oct. 15, 2024, 5:40 p.m. UTC | #4
On Tue, Oct 15, 2024 at 10:17:13AM -0700, Christoph Lameter (Ampere) wrote:
> On Tue, 15 Oct 2024, Catalin Marinas wrote:
> > > Setting of need_resched() from another processor involves sending an IPI
> > > after that was set. I dont think we need to smp_cond_load_relaxed since
> > > the IPI will cause an event. For ARM a WFE would be sufficient.
> >
> > I'm not worried about the need_resched() case, even without an IPI it
> > would still work.
> >
> > The loop_count++ side of the condition is supposed to timeout in the
> > absence of a need_resched() event. You can't do an smp_cond_load_*() on
> > a variable that's only updated by the waiting CPU. Nothing guarantees to
> > wake it up to update the variable (the event stream on arm64, yes, but
> > that's generic code).
> 
> Hmm... I have WFET implementation here without smp_cond modelled after
> the delay() implementation ARM64 (but its not generic and there is
> an additional patch required to make this work. Intermediate patch
> attached)

At least one additional patch ;). But yeah, I suggested hiding all this
behind something like smp_cond_load_timeout() which would wait on
current_thread_info()->flags but with a timeout. The arm64
implementation would follow some of the logic in __delay(). Others may
simply poll with cpu_relax().

Alternatively, if we get an IPI anyway, we can avoid smp_cond_load() and
rely on need_resched() and some new delay/cpu_relax() API that waits for
a timeout or an IPI, whichever comes first. E.g. cpu_relax_timeout()
which on arm64 it's just a simplified version of __delay() without the
'while' loops.
Ankur Arora Oct. 15, 2024, 9:32 p.m. UTC | #5
Catalin Marinas <catalin.marinas@arm.com> writes:

> On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote:
>> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
>> index 9b6d90a72601..fc1204426158 100644
>> --- a/drivers/cpuidle/poll_state.c
>> +++ b/drivers/cpuidle/poll_state.c
>> @@ -21,21 +21,20 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,
>>
>>  	raw_local_irq_enable();
>>  	if (!current_set_polling_and_test()) {
>> -		unsigned int loop_count = 0;
>>  		u64 limit;
>>
>>  		limit = cpuidle_poll_time(drv, dev);
>>
>>  		while (!need_resched()) {
>> -			cpu_relax();
>> -			if (loop_count++ < POLL_IDLE_RELAX_COUNT)
>> -				continue;
>> -
>> -			loop_count = 0;
>> +			unsigned int loop_count = 0;
>>  			if (local_clock_noinstr() - time_start > limit) {
>>  				dev->poll_time_limit = true;
>>  				break;
>>  			}
>> +
>> +			smp_cond_load_relaxed(&current_thread_info()->flags,
>> +					      VAL & _TIF_NEED_RESCHED ||
>> +					      loop_count++ >= POLL_IDLE_RELAX_COUNT);
>
> The above is not guaranteed to make progress if _TIF_NEED_RESCHED is
> never set. With the event stream enabled on arm64, the WFE will
> eventually be woken up, loop_count incremented and the condition would
> become true.

That makes sense.

> However, the smp_cond_load_relaxed() semantics require that
> a different agent updates the variable being waited on, not the waiting
> CPU updating it itself.

Right. And, that seems to work well with the semantics of WFE. And,
the event stream (if enabled) has a side effect that allows the exit
from the loop.

> Also note that the event stream can be disabled
> on arm64 on the kernel command line.

Yes, that's a good point. In patch-11 I tried to address that aspect
by only allowing haltpoll to be force loaded.

But, I guess your point is that its not just haltpoll that has a problem,
but also regular polling -- and maybe the right thing to do would be to
disable polling if the event stream is disabled.

> Does the code above break any other architecture?

Me (and others) have so far tested x86, ARM64 (with/without the
event stream), and I believe riscv. I haven't seen any obvious
breakage. But, that's probably because most of the time somebody would
be set TIF_NEED_RESCHED.

> I'd say if you want
> something like this, better introduce a new smp_cond_load_timeout()
> API. The above looks like a hack that may only work on arm64 when the
> event stream is enabled.

I had a preliminary version of smp_cond_load_relaxed_timeout() here:
 https://lore.kernel.org/lkml/87edae3a1x.fsf@oracle.com/

Even with an smp_cond_load_timeout(), we would need to fallback to
something like the above for uarchs without WFxT.

> A generic option is udelay() (on arm64 it would use WFE/WFET by
> default). Not sure how important it is for poll_idle() but the downside
> of udelay() that it won't be able to also poll need_resched() while
> waiting for the timeout. If this matters, you could instead make smaller
> udelay() calls. Yet another problem, I don't know how energy efficient
> udelay() is on x86 vs cpu_relax().
>
> So maybe an smp_cond_load_timeout() would be better, implemented with
> cpu_relax() generically and the arm64 would use LDXR, WFE and rely on
> the event stream (or fall back to cpu_relax() if the event stream is
> disabled).

Yeah, something like that might work.

--
ankur
Ankur Arora Oct. 15, 2024, 9:53 p.m. UTC | #6
Catalin Marinas <catalin.marinas@arm.com> writes:

> On Tue, Oct 15, 2024 at 10:17:13AM -0700, Christoph Lameter (Ampere) wrote:
>> On Tue, 15 Oct 2024, Catalin Marinas wrote:
>> > > Setting of need_resched() from another processor involves sending an IPI
>> > > after that was set. I dont think we need to smp_cond_load_relaxed since
>> > > the IPI will cause an event. For ARM a WFE would be sufficient.
>> >
>> > I'm not worried about the need_resched() case, even without an IPI it
>> > would still work.
>> >
>> > The loop_count++ side of the condition is supposed to timeout in the
>> > absence of a need_resched() event. You can't do an smp_cond_load_*() on
>> > a variable that's only updated by the waiting CPU. Nothing guarantees to
>> > wake it up to update the variable (the event stream on arm64, yes, but
>> > that's generic code).
>>
>> Hmm... I have WFET implementation here without smp_cond modelled after
>> the delay() implementation ARM64 (but its not generic and there is
>> an additional patch required to make this work. Intermediate patch
>> attached)
>
> At least one additional patch ;). But yeah, I suggested hiding all this
> behind something like smp_cond_load_timeout() which would wait on
> current_thread_info()->flags but with a timeout. The arm64
> implementation would follow some of the logic in __delay(). Others may
> simply poll with cpu_relax().
>
> Alternatively, if we get an IPI anyway, we can avoid smp_cond_load() and
> rely on need_resched() and some new delay/cpu_relax() API that waits for
> a timeout or an IPI, whichever comes first. E.g. cpu_relax_timeout()
> which on arm64 it's just a simplified version of __delay() without the
> 'while' loops.

AFAICT when polling (which we are since poll_idle() calls
current_set_polling_and_test()), the scheduler will elide the IPI
by remotely setting the need-resched bit via set_nr_if_polling().

Once we stop polling then the scheduler should take the IPI path
because call_function_single_prep_ipi() will fail.

--
ankur
Christoph Lameter (Ampere) Oct. 15, 2024, 10:28 p.m. UTC | #7
On Tue, 15 Oct 2024, Ankur Arora wrote:

> > Alternatively, if we get an IPI anyway, we can avoid smp_cond_load() and
> > rely on need_resched() and some new delay/cpu_relax() API that waits for
> > a timeout or an IPI, whichever comes first. E.g. cpu_relax_timeout()
> > which on arm64 it's just a simplified version of __delay() without the
> > 'while' loops.
>
> AFAICT when polling (which we are since poll_idle() calls
> current_set_polling_and_test()), the scheduler will elide the IPI
> by remotely setting the need-resched bit via set_nr_if_polling().

The scheduler runs on multiple cores. The core on which we are
running this code puts the core into a wait state so the scheduler does
not run on this core at all during the wait period.

The other cores may run scheduler functions and set the need_resched bit
for the core where we are currently waiting.

The other core will wake our core up by sending an IPI. The IPI will
invoke a scheduler function on our core and the WFE will continue.

> Once we stop polling then the scheduler should take the IPI path
> because call_function_single_prep_ipi() will fail.

The IPI stops the polling. IPI is an interrupt.
Christoph Lameter (Ampere) Oct. 15, 2024, 10:40 p.m. UTC | #8
Here is a patch that keeps the cpuidle stuiff generic but allows an
override by arm64..


From: Christoph Lameter (Ampere) <cl@linux.com>
Subject: Revise cpu poll idle to make full use of wfet() and wfe()

ARM64 has instructions that can wait for an event and timeouts.

Clean up the code in drivers/cpuidle/ to wait until the end
of a period and allow the override of the handling of the
waiting by an architecture.

Provide an optimized wait function for arm64.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/arch/arm64/lib/delay.c
===================================================================
--- linux.orig/arch/arm64/lib/delay.c
+++ linux/arch/arm64/lib/delay.c
@@ -12,6 +12,8 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/timex.h>
+#include <linux/sched/clock.h>
+#include <linux/cpuidle.h>

 #include <clocksource/arm_arch_timer.h>

@@ -67,3 +69,27 @@ void __ndelay(unsigned long nsecs)
 	__const_udelay(nsecs * 0x5UL); /* 2**32 / 1000000000 (rounded up) */
 }
 EXPORT_SYMBOL(__ndelay);
+
+void cpuidle_wait_for_resched_with_timeout(u64 end)
+{
+	u64 start;
+
+	while (!need_resched() && (start = local_clock_noinstr()) < end) {
+
+		if (alternative_has_cap_unlikely(ARM64_HAS_WFXT)) {
+
+			/* Processor supports waiting for a specified period */
+			wfet(xloops_to_cycles((end - start) * 0x5UL));
+
+		} else
+		if (arch_timer_evtstrm_available() && start + ARCH_TIMER_EVT_STREAM_PERIOD_US * 1000 < end) {
+
+			/* We can wait until a periodic event occurs */
+			wfe();
+
+		} else
+			/* Need to spin until the end */
+			cpu_relax();
+	}
+}
+
Index: linux/drivers/cpuidle/poll_state.c
===================================================================
--- linux.orig/drivers/cpuidle/poll_state.c
+++ linux/drivers/cpuidle/poll_state.c
@@ -8,35 +8,29 @@
 #include <linux/sched/clock.h>
 #include <linux/sched/idle.h>

-#define POLL_IDLE_RELAX_COUNT	200
+__weak void cpuidle_wait_for_resched_with_timeout(u64 end)
+{
+	while (!need_resched() && local_clock_noinstr() < end) {
+		cpu_relax();
+	}
+}

 static int __cpuidle poll_idle(struct cpuidle_device *dev,
 			       struct cpuidle_driver *drv, int index)
 {
-	u64 time_start;
-
-	time_start = local_clock_noinstr();
+	u64 time_start = local_clock_noinstr();
+	u64 time_end = time_start + cpuidle_poll_time(drv, dev);

 	dev->poll_time_limit = false;

 	raw_local_irq_enable();
 	if (!current_set_polling_and_test()) {
-		unsigned int loop_count = 0;
-		u64 limit;

-		limit = cpuidle_poll_time(drv, dev);
+		cpuidle_wait_for_resched_with_timeout(time_end);
+
+		if (!need_resched())
+			dev->poll_time_limit = true;

-		while (!need_resched()) {
-			cpu_relax();
-			if (loop_count++ < POLL_IDLE_RELAX_COUNT)
-				continue;
-
-			loop_count = 0;
-			if (local_clock_noinstr() - time_start > limit) {
-				dev->poll_time_limit = true;
-				break;
-			}
-		}
 	}
 	raw_local_irq_disable();

Index: linux/include/linux/cpuidle.h
===================================================================
--- linux.orig/include/linux/cpuidle.h
+++ linux/include/linux/cpuidle.h
@@ -202,6 +202,9 @@ extern int cpuidle_play_dead(void);
 extern struct cpuidle_driver *cpuidle_get_cpu_driver(struct cpuidle_device *dev);
 static inline struct cpuidle_device *cpuidle_get_device(void)
 {return __this_cpu_read(cpuidle_devices); }
+
+extern __weak void cpuidle_wait_for_resched_with_timeout(u64);
+
 #else
 static inline void disable_cpuidle(void) { }
 static inline bool cpuidle_not_available(struct cpuidle_driver *drv,
Ankur Arora Oct. 16, 2024, 7:06 a.m. UTC | #9
Christoph Lameter (Ampere) <cl@gentwo.org> writes:

> On Tue, 15 Oct 2024, Ankur Arora wrote:
>
>> > Alternatively, if we get an IPI anyway, we can avoid smp_cond_load() and
>> > rely on need_resched() and some new delay/cpu_relax() API that waits for
>> > a timeout or an IPI, whichever comes first. E.g. cpu_relax_timeout()
>> > which on arm64 it's just a simplified version of __delay() without the
>> > 'while' loops.
>>
>> AFAICT when polling (which we are since poll_idle() calls
>> current_set_polling_and_test()), the scheduler will elide the IPI
>> by remotely setting the need-resched bit via set_nr_if_polling().
>
> The scheduler runs on multiple cores. The core on which we are
> running this code puts the core into a wait state so the scheduler does
> not run on this core at all during the wait period.

Yes.

> The other cores may run scheduler functions and set the need_resched bit
> for the core where we are currently waiting.

Yes.

> The other core will wake our core up by sending an IPI. The IPI will
> invoke a scheduler function on our core and the WFE will continue.

Why? The target core is not sleeping. It is *polling* on a memory
address (on arm64, via LDXR; WFE). Ergo an IPI is not needed to tell
it that a need-resched bit is set.

>> Once we stop polling then the scheduler should take the IPI path
>> because call_function_single_prep_ipi() will fail.
>
> The IPI stops the polling. IPI is an interrupt.

Yes an IPI is an interrupt. And, since the target is polling there's
no need for an interrupt to inform it that a memory address on which
it is polling has changed.

resched_curr() is a good example. It only sends the resched IPI if the
target is not polling.

resched_curr() {
         ...
        if (set_nr_and_not_polling(curr))
                smp_send_reschedule(cpu);
        else
                trace_sched_wake_idle_without_ipi(cpu);
}

--
ankur
Catalin Marinas Oct. 16, 2024, 9:54 a.m. UTC | #10
On Tue, Oct 15, 2024 at 03:40:33PM -0700, Christoph Lameter (Ampere) wrote:
> Index: linux/arch/arm64/lib/delay.c
> ===================================================================
> --- linux.orig/arch/arm64/lib/delay.c
> +++ linux/arch/arm64/lib/delay.c
> @@ -12,6 +12,8 @@
>  #include <linux/kernel.h>
>  #include <linux/module.h>
>  #include <linux/timex.h>
> +#include <linux/sched/clock.h>
> +#include <linux/cpuidle.h>
> 
>  #include <clocksource/arm_arch_timer.h>
> 
> @@ -67,3 +69,27 @@ void __ndelay(unsigned long nsecs)
>  	__const_udelay(nsecs * 0x5UL); /* 2**32 / 1000000000 (rounded up) */
>  }
>  EXPORT_SYMBOL(__ndelay);
> +
> +void cpuidle_wait_for_resched_with_timeout(u64 end)
> +{
> +	u64 start;
> +
> +	while (!need_resched() && (start = local_clock_noinstr()) < end) {
> +
> +		if (alternative_has_cap_unlikely(ARM64_HAS_WFXT)) {
> +
> +			/* Processor supports waiting for a specified period */
> +			wfet(xloops_to_cycles((end - start) * 0x5UL));
> +
> +		} else
> +		if (arch_timer_evtstrm_available() && start + ARCH_TIMER_EVT_STREAM_PERIOD_US * 1000 < end) {
> +
> +			/* We can wait until a periodic event occurs */
> +			wfe();
> +
> +		} else
> +			/* Need to spin until the end */
> +			cpu_relax();
> +	}
> +}

The behaviour above is slightly different from the current poll_idle()
implementation. The above is more like poll every timeout period rather
than continuously poll until either the need_resched() condition is true
_or_ the timeout expired. From Ankur's email, an IPI may not happen so
we don't have any guarantee that WFET will wake up before the timeout.
The only way for WFE/WFET to wake up on need_resched() is to use LDXR to
arm the exclusive monitor. That's what smp_cond_load_relaxed() does.

If you only need the behaviour proposed above, you might as well go for
udelay() directly. Otherwise I think we need to revisit Ankur's
smp_cond_load_timeout() proposal from earlier this year.
Catalin Marinas Oct. 16, 2024, 10:06 a.m. UTC | #11
On Tue, Oct 15, 2024 at 02:32:00PM -0700, Ankur Arora wrote:
> Catalin Marinas <catalin.marinas@arm.com> writes:
> > I'd say if you want
> > something like this, better introduce a new smp_cond_load_timeout()
> > API. The above looks like a hack that may only work on arm64 when the
> > event stream is enabled.
> 
> I had a preliminary version of smp_cond_load_relaxed_timeout() here:
>  https://lore.kernel.org/lkml/87edae3a1x.fsf@oracle.com/
> 
> Even with an smp_cond_load_timeout(), we would need to fallback to
> something like the above for uarchs without WFxT.

Yes, the default/generic implementation would use cpu_relax(). For the
arm64 one, some combination of __cmpwait() and __delay() with the
fallback to cpu_relax().
Okanovic, Haris Oct. 16, 2024, 3:13 p.m. UTC | #12
On Tue, 2024-10-15 at 13:04 +0100, Catalin Marinas wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote:
> > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
> > index 9b6d90a72601..fc1204426158 100644
> > --- a/drivers/cpuidle/poll_state.c
> > +++ b/drivers/cpuidle/poll_state.c
> > @@ -21,21 +21,20 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,
> > 
> >       raw_local_irq_enable();
> >       if (!current_set_polling_and_test()) {
> > -             unsigned int loop_count = 0;
> >               u64 limit;
> > 
> >               limit = cpuidle_poll_time(drv, dev);
> > 
> >               while (!need_resched()) {
> > -                     cpu_relax();
> > -                     if (loop_count++ < POLL_IDLE_RELAX_COUNT)
> > -                             continue;
> > -
> > -                     loop_count = 0;
> > +                     unsigned int loop_count = 0;
> >                       if (local_clock_noinstr() - time_start > limit) {
> >                               dev->poll_time_limit = true;
> >                               break;
> >                       }
> > +
> > +                     smp_cond_load_relaxed(&current_thread_info()->flags,
> > +                                           VAL & _TIF_NEED_RESCHED ||
> > +                                           loop_count++ >= POLL_IDLE_RELAX_COUNT);
> 
> The above is not guaranteed to make progress if _TIF_NEED_RESCHED is
> never set. With the event stream enabled on arm64, the WFE will
> eventually be woken up, loop_count incremented and the condition would
> become true. However, the smp_cond_load_relaxed() semantics require that
> a different agent updates the variable being waited on, not the waiting
> CPU updating it itself. Also note that the event stream can be disabled
> on arm64 on the kernel command line.

Alternately could we condition arch_haltpoll_want() on
arch_timer_evtstrm_available(), like v7?

> 
> Does the code above break any other architecture? I'd say if you want
> something like this, better introduce a new smp_cond_load_timeout()
> API. The above looks like a hack that may only work on arm64 when the
> event stream is enabled.
> 
> A generic option is udelay() (on arm64 it would use WFE/WFET by
> default). Not sure how important it is for poll_idle() but the downside
> of udelay() that it won't be able to also poll need_resched() while
> waiting for the timeout. If this matters, you could instead make smaller
> udelay() calls. Yet another problem, I don't know how energy efficient
> udelay() is on x86 vs cpu_relax().
> 
> So maybe an smp_cond_load_timeout() would be better, implemented with
> cpu_relax() generically and the arm64 would use LDXR, WFE and rely on
> the event stream (or fall back to cpu_relax() if the event stream is
> disabled).
> 
> --
> Catalin
Okanovic, Haris Oct. 16, 2024, 6:04 p.m. UTC | #13
On Wed, 2024-10-16 at 10:04 -0700, Ankur Arora wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> Okanovic, Haris <harisokn@amazon.com> writes:
> 
> > On Tue, 2024-10-15 at 13:04 +0100, Catalin Marinas wrote:
> > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> > > 
> > > 
> > > 
> > > On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote:
> > > > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
> > > > index 9b6d90a72601..fc1204426158 100644
> > > > --- a/drivers/cpuidle/poll_state.c
> > > > +++ b/drivers/cpuidle/poll_state.c
> > > > @@ -21,21 +21,20 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,
> > > > 
> > > >       raw_local_irq_enable();
> > > >       if (!current_set_polling_and_test()) {
> > > > -             unsigned int loop_count = 0;
> > > >               u64 limit;
> > > > 
> > > >               limit = cpuidle_poll_time(drv, dev);
> > > > 
> > > >               while (!need_resched()) {
> > > > -                     cpu_relax();
> > > > -                     if (loop_count++ < POLL_IDLE_RELAX_COUNT)
> > > > -                             continue;
> > > > -
> > > > -                     loop_count = 0;
> > > > +                     unsigned int loop_count = 0;
> > > >                       if (local_clock_noinstr() - time_start > limit) {
> > > >                               dev->poll_time_limit = true;
> > > >                               break;
> > > >                       }
> > > > +
> > > > +                     smp_cond_load_relaxed(&current_thread_info()->flags,
> > > > +                                           VAL & _TIF_NEED_RESCHED ||
> > > > +                                           loop_count++ >= POLL_IDLE_RELAX_COUNT);
> > > 
> > > The above is not guaranteed to make progress if _TIF_NEED_RESCHED is
> > > never set. With the event stream enabled on arm64, the WFE will
> > > eventually be woken up, loop_count incremented and the condition would
> > > become true. However, the smp_cond_load_relaxed() semantics require that
> > > a different agent updates the variable being waited on, not the waiting
> > > CPU updating it itself. Also note that the event stream can be disabled
> > > on arm64 on the kernel command line.
> > 
> > Alternately could we condition arch_haltpoll_want() on
> > arch_timer_evtstrm_available(), like v7?
> 
> Yes, I'm thinking of staging it somewhat like that. First an
> smp_cond_load_relaxed() which gets rid of this issue, followed by
> one based on smp_cond_load_relaxed_timeout().
> 
> That said, conditioning just arch_haltpoll_want() won't suffice since
> what Catalin pointed out affects all users of poll_idle(), not just
> haltpoll.

The only other users I see today are apm_init() and
acpi_processor_setup_cstates(), both in x86 path. Perhaps not ideal,
but should be sufficient.

> 
> Right now there's only haltpoll but there are future users like
> zhenglifeng with a patch for acpi-idle here:
> 
>   https://lore.kernel.org/all/f8a1f85b-c4bf-4c38-81bf-728f72a4f2fe@huawei.com/
> 
> > > Does the code above break any other architecture? I'd say if you want
> > > something like this, better introduce a new smp_cond_load_timeout()
> > > API. The above looks like a hack that may only work on arm64 when the
> > > event stream is enabled.
> > > 
> > > A generic option is udelay() (on arm64 it would use WFE/WFET by
> > > default). Not sure how important it is for poll_idle() but the downside
> > > of udelay() that it won't be able to also poll need_resched() while
> > > waiting for the timeout. If this matters, you could instead make smaller
> > > udelay() calls. Yet another problem, I don't know how energy efficient
> > > udelay() is on x86 vs cpu_relax().
> > > 
> > > So maybe an smp_cond_load_timeout() would be better, implemented with
> > > cpu_relax() generically and the arm64 would use LDXR, WFE and rely on
> > > the event stream (or fall back to cpu_relax() if the event stream is
> > > disabled).
> > > 
> > > --
> > > Catalin
> 
> 
> --
> ankur
Catalin Marinas Oct. 17, 2024, 2:01 p.m. UTC | #14
On Wed, Oct 16, 2024 at 03:13:33PM +0000, Okanovic, Haris wrote:
> On Tue, 2024-10-15 at 13:04 +0100, Catalin Marinas wrote:
> > On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote:
> > > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
> > > index 9b6d90a72601..fc1204426158 100644
> > > --- a/drivers/cpuidle/poll_state.c
> > > +++ b/drivers/cpuidle/poll_state.c
> > > @@ -21,21 +21,20 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,
> > > 
> > >       raw_local_irq_enable();
> > >       if (!current_set_polling_and_test()) {
> > > -             unsigned int loop_count = 0;
> > >               u64 limit;
> > > 
> > >               limit = cpuidle_poll_time(drv, dev);
> > > 
> > >               while (!need_resched()) {
> > > -                     cpu_relax();
> > > -                     if (loop_count++ < POLL_IDLE_RELAX_COUNT)
> > > -                             continue;
> > > -
> > > -                     loop_count = 0;
> > > +                     unsigned int loop_count = 0;
> > >                       if (local_clock_noinstr() - time_start > limit) {
> > >                               dev->poll_time_limit = true;
> > >                               break;
> > >                       }
> > > +
> > > +                     smp_cond_load_relaxed(&current_thread_info()->flags,
> > > +                                           VAL & _TIF_NEED_RESCHED ||
> > > +                                           loop_count++ >= POLL_IDLE_RELAX_COUNT);
> > 
> > The above is not guaranteed to make progress if _TIF_NEED_RESCHED is
> > never set. With the event stream enabled on arm64, the WFE will
> > eventually be woken up, loop_count incremented and the condition would
> > become true. However, the smp_cond_load_relaxed() semantics require that
> > a different agent updates the variable being waited on, not the waiting
> > CPU updating it itself. Also note that the event stream can be disabled
> > on arm64 on the kernel command line.
> 
> Alternately could we condition arch_haltpoll_want() on
> arch_timer_evtstrm_available(), like v7?

No. The problem is about the smp_cond_load_relaxed() semantics - it
can't wait on a variable that's only updated in its exit condition. We
need a new API for this, especially since we are changing generic code
here (even it was arm64 code only, I'd still object to such
smp_cond_load_*() constructs).
Christoph Lameter (Ampere) Oct. 17, 2024, 4:54 p.m. UTC | #15
On Wed, 16 Oct 2024, Ankur Arora wrote:

> > The other core will wake our core up by sending an IPI. The IPI will
> > invoke a scheduler function on our core and the WFE will continue.
>
> Why? The target core is not sleeping. It is *polling* on a memory
> address (on arm64, via LDXR; WFE). Ergo an IPI is not needed to tell
> it that a need-resched bit is set.

The IPI is sent to interrupt the process that is not sleeping. This is
done so the busy processor can reschedule the currently running process
and respond to the event.

It does not matter if the core is "sleeping" or not.
Christoph Lameter (Ampere) Oct. 17, 2024, 4:56 p.m. UTC | #16
On Wed, 16 Oct 2024, Catalin Marinas wrote:

> The behaviour above is slightly different from the current poll_idle()
> implementation. The above is more like poll every timeout period rather
> than continuously poll until either the need_resched() condition is true
> _or_ the timeout expired. From Ankur's email, an IPI may not happen so
> we don't have any guarantee that WFET will wake up before the timeout.
> The only way for WFE/WFET to wake up on need_resched() is to use LDXR to
> arm the exclusive monitor. That's what smp_cond_load_relaxed() does.

Sorry no. The IPI will cause the WFE to continue immediately and not wait
till the end of the timeout period.
Catalin Marinas Oct. 17, 2024, 6:15 p.m. UTC | #17
On Thu, Oct 17, 2024 at 09:56:13AM -0700, Christoph Lameter (Ampere) wrote:
> On Wed, 16 Oct 2024, Catalin Marinas wrote:
> > The behaviour above is slightly different from the current poll_idle()
> > implementation. The above is more like poll every timeout period rather
> > than continuously poll until either the need_resched() condition is true
> > _or_ the timeout expired. From Ankur's email, an IPI may not happen so
> > we don't have any guarantee that WFET will wake up before the timeout.
> > The only way for WFE/WFET to wake up on need_resched() is to use LDXR to
> > arm the exclusive monitor. That's what smp_cond_load_relaxed() does.
> 
> Sorry no. The IPI will cause the WFE to continue immediately and not wait
> till the end of the timeout period.

*If* there is an IPI. The scheduler is not really my area but some
functions like wake_up_idle_cpu() seem to elide the IPI if
TIF_NR_POLLING is set.

But even if we had an IPI, it still feels like abusing the semantics of
smp_cond_load_relaxed() when relying on it to increment a variable in
the condition check as a result of some unrelated wake-up event. This
API is meant to wait for a condition on a single variable. It cannot
wait on multiple variables and especially not one it updates itself
(even if it happens to work on arm64 under certain conditions).

My strong preference would be to revive the smp_cond_load_timeout()
proposal from Ankur earlier in the year.
Ankur Arora Oct. 17, 2024, 6:36 p.m. UTC | #18
Christoph Lameter (Ampere) <cl@gentwo.org> writes:

> On Wed, 16 Oct 2024, Ankur Arora wrote:
>
>> > The other core will wake our core up by sending an IPI. The IPI will
>> > invoke a scheduler function on our core and the WFE will continue.
>>
>> Why? The target core is not sleeping. It is *polling* on a memory
>> address (on arm64, via LDXR; WFE). Ergo an IPI is not needed to tell
>> it that a need-resched bit is set.
>
> The IPI is sent to interrupt the process that is not sleeping. This is
> done so the busy processor can reschedule the currently running process
> and respond to the event.

The scheduler treats idle specially (if the architecture defines
TIF_POLLING_NRFLAG). There's also the sched_wake_idle_without_ipi
tracepoint for this path.

$ sudo perf stat -e sched:sched_wake_idle_without_ipi perf bench sched pipe
# Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes

     Total time: 5.173 [sec]

       5.173613 usecs/op
         193288 ops/sec

 Performance counter stats for 'perf bench sched pipe':

         1,992,368      sched:sched_wake_idle_without_ipi

       5.178976487 seconds time elapsed

       0.396076000 seconds user
       6.999566000 seconds sys

--
ankur
Ankur Arora Oct. 17, 2024, 7:34 p.m. UTC | #19
Catalin Marinas <catalin.marinas@arm.com> writes:

> On Thu, Oct 17, 2024 at 09:56:13AM -0700, Christoph Lameter (Ampere) wrote:
>> On Wed, 16 Oct 2024, Catalin Marinas wrote:
>> > The behaviour above is slightly different from the current poll_idle()
>> > implementation. The above is more like poll every timeout period rather
>> > than continuously poll until either the need_resched() condition is true
>> > _or_ the timeout expired. From Ankur's email, an IPI may not happen so
>> > we don't have any guarantee that WFET will wake up before the timeout.
>> > The only way for WFE/WFET to wake up on need_resched() is to use LDXR to
>> > arm the exclusive monitor. That's what smp_cond_load_relaxed() does.
>>
>> Sorry no. The IPI will cause the WFE to continue immediately and not wait
>> till the end of the timeout period.
>
> *If* there is an IPI. The scheduler is not really my area but some
> functions like wake_up_idle_cpu() seem to elide the IPI if
> TIF_NR_POLLING is set.
>
> But even if we had an IPI, it still feels like abusing the semantics of
> smp_cond_load_relaxed() when relying on it to increment a variable in
> the condition check as a result of some unrelated wake-up event. This
> API is meant to wait for a condition on a single variable. It cannot
> wait on multiple variables and especially not one it updates itself
> (even if it happens to work on arm64 under certain conditions).

Yeah that makes sense. smp_cond_load_relaxed() uses two separate
side-effects to make sure things work: the event-stream and the
increment in the conditional.

I do want to thresh out smp_cond_load_timeout() a bit more but let
me reply to your other mail for that.

> My strong preference would be to revive the smp_cond_load_timeout()
> proposal from Ankur earlier in the year.

Ack that.

--
ankur
Ankur Arora Oct. 17, 2024, 10:47 p.m. UTC | #20
Catalin Marinas <catalin.marinas@arm.com> writes:

> On Wed, Oct 16, 2024 at 03:13:33PM +0000, Okanovic, Haris wrote:
>> On Tue, 2024-10-15 at 13:04 +0100, Catalin Marinas wrote:
>> > On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote:
>> > > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
>> > > index 9b6d90a72601..fc1204426158 100644
>> > > --- a/drivers/cpuidle/poll_state.c
>> > > +++ b/drivers/cpuidle/poll_state.c
>> > > @@ -21,21 +21,20 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,
>> > >
>> > >       raw_local_irq_enable();
>> > >       if (!current_set_polling_and_test()) {
>> > > -             unsigned int loop_count = 0;
>> > >               u64 limit;
>> > >
>> > >               limit = cpuidle_poll_time(drv, dev);
>> > >
>> > >               while (!need_resched()) {
>> > > -                     cpu_relax();
>> > > -                     if (loop_count++ < POLL_IDLE_RELAX_COUNT)
>> > > -                             continue;
>> > > -
>> > > -                     loop_count = 0;
>> > > +                     unsigned int loop_count = 0;
>> > >                       if (local_clock_noinstr() - time_start > limit) {
>> > >                               dev->poll_time_limit = true;
>> > >                               break;
>> > >                       }
>> > > +
>> > > +                     smp_cond_load_relaxed(&current_thread_info()->flags,
>> > > +                                           VAL & _TIF_NEED_RESCHED ||
>> > > +                                           loop_count++ >= POLL_IDLE_RELAX_COUNT);
>> >
>> > The above is not guaranteed to make progress if _TIF_NEED_RESCHED is
>> > never set. With the event stream enabled on arm64, the WFE will
>> > eventually be woken up, loop_count incremented and the condition would
>> > become true. However, the smp_cond_load_relaxed() semantics require that
>> > a different agent updates the variable being waited on, not the waiting
>> > CPU updating it itself. Also note that the event stream can be disabled
>> > on arm64 on the kernel command line.
>>
>> Alternately could we condition arch_haltpoll_want() on
>> arch_timer_evtstrm_available(), like v7?
>
> No. The problem is about the smp_cond_load_relaxed() semantics - it
> can't wait on a variable that's only updated in its exit condition. We
> need a new API for this, especially since we are changing generic code
> here (even it was arm64 code only, I'd still object to such
> smp_cond_load_*() constructs).

Right. The problem is that smp_cond_load_relaxed() used in this context
depends on the event-stream side effect when the interface does not
encode those semantics anywhere.

So, a smp_cond_load_timeout() like in [1] that continues to depend on
the event-stream is better because it explicitly accounts for the side
effect from the timeout.

This would cover both the WFxT and the event-stream case.

The part I'm a little less sure about is the case where WFxT and the
event-stream are absent.

As you said earlier, for that case on arm64, we use either short
__delay() calls or spin in cpu_relax(), both of which are essentially
the same thing.

Now on x86 cpu_relax() is quite optimal. The spec explicitly recommends
it and from my measurement a loop doing "while (!cond) cpu_relax()" gets
an IPC of something like 0.1 or similar.

On my arm64 systems however the same loop gets an IPC of 2.  Now this
likely varies greatly but seems like it would run pretty hot some of
the time.

So maybe the right thing to do would be to keep smp_cond_load_timeout()
but only allow polling if WFxT or event-stream is enabled. And enhance
cpuidle_poll_state_init() to fail if the above condition is not met.

Does that make sense?

Thanks
Ankur

[1] https://lore.kernel.org/lkml/87edae3a1x.fsf@oracle.com/
Catalin Marinas Oct. 18, 2024, 11:05 a.m. UTC | #21
On Thu, Oct 17, 2024 at 03:47:31PM -0700, Ankur Arora wrote:
> Catalin Marinas <catalin.marinas@arm.com> writes:
> > On Wed, Oct 16, 2024 at 03:13:33PM +0000, Okanovic, Haris wrote:
> >> On Tue, 2024-10-15 at 13:04 +0100, Catalin Marinas wrote:
> >> > On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote:
> >> > > +                     smp_cond_load_relaxed(&current_thread_info()->flags,
> >> > > +                                           VAL & _TIF_NEED_RESCHED ||
> >> > > +                                           loop_count++ >= POLL_IDLE_RELAX_COUNT);
> >> >
> >> > The above is not guaranteed to make progress if _TIF_NEED_RESCHED is
> >> > never set. With the event stream enabled on arm64, the WFE will
> >> > eventually be woken up, loop_count incremented and the condition would
> >> > become true. However, the smp_cond_load_relaxed() semantics require that
> >> > a different agent updates the variable being waited on, not the waiting
> >> > CPU updating it itself. Also note that the event stream can be disabled
> >> > on arm64 on the kernel command line.
> >>
> >> Alternately could we condition arch_haltpoll_want() on
> >> arch_timer_evtstrm_available(), like v7?
> >
> > No. The problem is about the smp_cond_load_relaxed() semantics - it
> > can't wait on a variable that's only updated in its exit condition. We
> > need a new API for this, especially since we are changing generic code
> > here (even it was arm64 code only, I'd still object to such
> > smp_cond_load_*() constructs).
> 
> Right. The problem is that smp_cond_load_relaxed() used in this context
> depends on the event-stream side effect when the interface does not
> encode those semantics anywhere.
> 
> So, a smp_cond_load_timeout() like in [1] that continues to depend on
> the event-stream is better because it explicitly accounts for the side
> effect from the timeout.
> 
> This would cover both the WFxT and the event-stream case.

Indeed.

> The part I'm a little less sure about is the case where WFxT and the
> event-stream are absent.
> 
> As you said earlier, for that case on arm64, we use either short
> __delay() calls or spin in cpu_relax(), both of which are essentially
> the same thing.

Something derived from __delay(), not exactly this function. We can't
use it directly as we also want it to wake up if an event is generated
as a result of a memory write (like the current smp_cond_load().

> Now on x86 cpu_relax() is quite optimal. The spec explicitly recommends
> it and from my measurement a loop doing "while (!cond) cpu_relax()" gets
> an IPC of something like 0.1 or similar.
> 
> On my arm64 systems however the same loop gets an IPC of 2.  Now this
> likely varies greatly but seems like it would run pretty hot some of
> the time.

For the cpu_relax() fall-back, it wouldn't be any worse than the current
poll_idle() code, though I guess in this instance we'd not enable idle
polling.

I expect the event stream to be on in all production deployments. The
reason we have a way to disable it is for testing. We've had hardware
errata in the past where the event on spin_unlock doesn't cross the
cluster boundary. We'd not notice because of the event stream.

> So maybe the right thing to do would be to keep smp_cond_load_timeout()
> but only allow polling if WFxT or event-stream is enabled. And enhance
> cpuidle_poll_state_init() to fail if the above condition is not met.

We could do this as well. Maybe hide this behind another function like
arch_has_efficient_smp_cond_load_timeout() (well, some shorter name),
checked somewhere in or on the path to cpuidle_poll_state_init(). Well,
it might be simpler to do this in haltpoll_want(), backed by an
arch_haltpoll_want() function.

I assume we want poll_idle() to wake up as soon as a task becomes
available. Otherwise we could have just used udelay() for some fraction
of cpuidle_poll_time() instead of cpu_relax().
Catalin Marinas Oct. 21, 2024, 12:02 p.m. UTC | #22
On Fri, Oct 18, 2024 at 12:00:34PM -0700, Ankur Arora wrote:
> Catalin Marinas <catalin.marinas@arm.com> writes:
> > On Thu, Oct 17, 2024 at 03:47:31PM -0700, Ankur Arora wrote:
> >> So maybe the right thing to do would be to keep smp_cond_load_timeout()
> >> but only allow polling if WFxT or event-stream is enabled. And enhance
> >> cpuidle_poll_state_init() to fail if the above condition is not met.
> >
> > We could do this as well. Maybe hide this behind another function like
> > arch_has_efficient_smp_cond_load_timeout() (well, some shorter name),
> > checked somewhere in or on the path to cpuidle_poll_state_init(). Well,
> > it might be simpler to do this in haltpoll_want(), backed by an
> > arch_haltpoll_want() function.
> 
> Yeah, checking in arch_haltpoll_want() would mean that we can leave all
> the cpuidle_poll_state_init() call sites unchanged.
> 
> However, I suspect that even acpi-idle on arm64 might end up using
> poll_idle() (as this patch tries to do:
> https://lore.kernel.org/lkml/f8a1f85b-c4bf-4c38-81bf-728f72a4f2fe@huawei.com/).
> 
> So, let me try doing it both ways to see which one is simpler.
> Given that the event-stream can be assumed to be always-on it might just
> be more straight-forward to fallback to cpu_relax() in that edge case.

I agree, let's go with the simplest. If one has some strong case for
running with the event stream disabled and idle polling becomes too
energy inefficient, we can revisit and add some run-time checks.
diff mbox series

Patch

diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
index 9b6d90a72601..fc1204426158 100644
--- a/drivers/cpuidle/poll_state.c
+++ b/drivers/cpuidle/poll_state.c
@@ -21,21 +21,20 @@  static int __cpuidle poll_idle(struct cpuidle_device *dev,
 
 	raw_local_irq_enable();
 	if (!current_set_polling_and_test()) {
-		unsigned int loop_count = 0;
 		u64 limit;
 
 		limit = cpuidle_poll_time(drv, dev);
 
 		while (!need_resched()) {
-			cpu_relax();
-			if (loop_count++ < POLL_IDLE_RELAX_COUNT)
-				continue;
-
-			loop_count = 0;
+			unsigned int loop_count = 0;
 			if (local_clock_noinstr() - time_start > limit) {
 				dev->poll_time_limit = true;
 				break;
 			}
+
+			smp_cond_load_relaxed(&current_thread_info()->flags,
+					      VAL & _TIF_NEED_RESCHED ||
+					      loop_count++ >= POLL_IDLE_RELAX_COUNT);
 		}
 	}
 	raw_local_irq_disable();