mbox series

[v10,00/11] arm64: support poll_idle()

Message ID 20250218213337.377987-1-ankur.a.arora@oracle.com
Headers show
Series arm64: support poll_idle() | expand

Message

Ankur Arora Feb. 18, 2025, 9:33 p.m. UTC
Hi,

This patchset adds support for polling in idle on arm64 via poll_idle()
and adds the requisite support to acpi-idle and cpuidle-haltpoll.

v10 is a respin of v9 with the timed wait barrier logic 
(smp_cond_load_relaxed_timewait()) moved out into a separate
series [0]. (The barrier patches could also do with some eyes.)


Why poll in idle?
==

The benefit of polling in idle is to reduce the cost (and latency)
of remote wakeups. When enabled, these can be done just by setting the
need-resched bit, eliding the IPI, and the cost of handling the
interrupt on the receiver.

Comparing sched-pipe performance on a guest VM:

# perf stat -r 5 --cpu 4,5 -e task-clock,cycles,instructions \
   -e sched:sched_wake_idle_without_ipi perf bench sched pipe -l 1000000 --cpu 4

# without polling in idle

 Performance counter stats for 'CPU(s) 4,5' (5 runs):

         25,229.57 msec task-clock                       #    2.000 CPUs utilized               ( +-  7.75% )
    45,821,250,284      cycles                           #    1.816 GHz                         ( +- 10.07% )
    26,557,496,665      instructions                     #    0.58  insn per cycle              ( +-  0.21% )
                 0      sched:sched_wake_idle_without_ipi #    0.000 /sec

            12.615 +- 0.977 seconds time elapsed  ( +-  7.75% )


# polling in idle (with haltpoll):

 Performance counter stats for 'CPU(s) 4,5' (5 runs):

         15,131.58 msec task-clock                       #    2.000 CPUs utilized               ( +- 10.00% )
    34,158,188,839      cycles                           #    2.257 GHz                         ( +-  6.91% )
    20,824,950,916      instructions                     #    0.61  insn per cycle              ( +-  0.09% )
         1,983,822      sched:sched_wake_idle_without_ipi #  131.105 K/sec                      ( +-  0.78% )

             7.566 +- 0.756 seconds time elapsed  ( +- 10.00% )

Comparing the two cases, there's a significant drop in both cycles and
instructions executed. And a signficant drop in the wakeup latency.

Tomohiro Misono and Haris Okanovic also report similar latency
improvements on Grace and Graviton systems (for v7) [1] [2].
Haris also tested a modified v9 on top of the split out barrier
primitives.

Lifeng also reports improved context switch latency on a bare-metal
machine with acpi-idle [3].


Series layout
==

 - patches 1-3,

    "cpuidle/poll_state: poll via smp_cond_load_relaxed_timewait()"
    "cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL"
    "Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig"

   switch poll_idle() to using the new barrier interface. Also, do some
   munging of related kconfig options.

 - patches 4-5,

    "arm64: define TIF_POLLING_NRFLAG"
    "arm64: add support for poll_idle()"

   add arm64 support for the polling flag and enable poll_idle()
   support.

 - patches 6, 7-11,

    "ACPI: processor_idle: Support polling state for LPI"

    "cpuidle-haltpoll: define arch_haltpoll_want()"
    "governors/haltpoll: drop kvm_para_available() check"
    "cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL"

    "arm64: idle: export arch_cpu_idle()"
    "arm64: support cpuidle-haltpoll"

    add support for polling via acpi-idle, and cpuidle-haltpoll.


Changelog
==

v10: respin of v9
   - sent out smp_cond_load_relaxed_timeout() separately [0]
     - Dropped from this series:
        "asm-generic: add barrier smp_cond_load_relaxed_timeout()"
        "arm64: barrier: add support for smp_cond_relaxed_timeout()"
        "arm64/delay: move some constants out to a separate header"
        "arm64: support WFET in smp_cond_relaxed_timeout()"

   - reworded some commit messages

v9:
 - reworked the series to address a comment from Catalin Marinas
   about how v8 was abusing semantics of smp_cond_load_relaxed().
 - add poll_idle() support in acpi-idle (Lifeng Zheng)
 - dropped some earlier "Tested-by", "Reviewed-by" due to the
   above rework.

v8: No logic changes. Largely respin of v7, with changes
noted below:

 - move selection of ARCH_HAS_OPTIMIZED_POLL on arm64 to its
   own patch.
   (patch-9 "arm64: select ARCH_HAS_OPTIMIZED_POLL")
   
 - address comments simplifying arm64 support (Will Deacon)
   (patch-11 "arm64: support cpuidle-haltpoll")

v7: No significant logic changes. Mostly a respin of v6.

 - minor cleanup in poll_idle() (Christoph Lameter)
 - fixes conflicts due to code movement in arch/arm64/kernel/cpuidle.c
   (Tomohiro Misono)

v6:

 - reordered the patches to keep poll_idle() and ARCH_HAS_OPTIMIZED_POLL
   changes together (comment from Christoph Lameter)
 - threshes out the commit messages a bit more (comments from Christoph
   Lameter, Sudeep Holla)
 - also rework selection of cpuidle-haltpoll. Now selected based
   on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
 - moved back to arch_haltpoll_want() (comment from Joao Martins)
   Also, arch_haltpoll_want() now takes the force parameter and is
   now responsible for the complete selection (or not) of haltpoll.
 - fixes the build breakage on i386
 - fixes the cpuidle-haltpoll module breakage on arm64 (comment from
   Tomohiro Misono, Haris Okanovic)

v5:
 - rework the poll_idle() loop around smp_cond_load_relaxed() (review
   comment from Tomohiro Misono.)
 - also rework selection of cpuidle-haltpoll. Now selected based
   on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
 - arch_haltpoll_supported() (renamed from arch_haltpoll_want()) on
   arm64 now depends on the event-stream being enabled.
 - limit POLL_IDLE_RELAX_COUNT on arm64 (review comment from Haris Okanovic)
 - ARCH_HAS_CPU_RELAX is now renamed to ARCH_HAS_OPTIMIZED_POLL.

v4 changes from v3:
 - change 7/8 per Rafael input: drop the parens and use ret for the final check
 - add 8/8 which renames the guard for building poll_state

v3 changes from v2:
 - fix 1/7 per Petr Mladek - remove ARCH_HAS_CPU_RELAX from arch/x86/Kconfig
 - add Ack-by from Rafael Wysocki on 2/7

v2 changes from v1:
 - added patch 7 where we change cpu_relax with smp_cond_load_relaxed per PeterZ
   (this improves by 50% at least the CPU cycles consumed in the tests above:
   10,716,881,137 now vs 14,503,014,257 before)
 - removed the ifdef from patch 1 per RafaelW


Would appreciate any review comments.

Ankur


[0] https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/
[1] https://lore.kernel.org/lkml/TY3PR01MB111481E9B0AF263ACC8EA5D4AE5BA2@TY3PR01MB11148.jpnprd01.prod.outlook.com/
[2] https://lore.kernel.org/lkml/104d0ec31cb45477e27273e089402d4205ee4042.camel@amazon.com/
[3] https://lore.kernel.org/lkml/f8a1f85b-c4bf-4c38-81bf-728f72a4f2fe@huawei.com/

Ankur Arora (6):
  cpuidle/poll_state: poll via smp_cond_load_relaxed_timewait()
  cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL
  arm64: add support for poll_idle()
  cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL
  arm64: idle: export arch_cpu_idle()
  arm64: support cpuidle-haltpoll

Joao Martins (4):
  Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig
  arm64: define TIF_POLLING_NRFLAG
  cpuidle-haltpoll: define arch_haltpoll_want()
  governors/haltpoll: drop kvm_para_available() check

Lifeng Zheng (1):
  ACPI: processor_idle: Support polling state for LPI

 arch/Kconfig                              |  3 ++
 arch/arm64/Kconfig                        |  7 ++++
 arch/arm64/include/asm/cpuidle_haltpoll.h | 20 +++++++++++
 arch/arm64/include/asm/thread_info.h      |  2 ++
 arch/arm64/kernel/idle.c                  |  1 +
 arch/x86/Kconfig                          |  5 ++-
 arch/x86/include/asm/cpuidle_haltpoll.h   |  1 +
 arch/x86/kernel/kvm.c                     | 13 +++++++
 drivers/acpi/processor_idle.c             | 43 +++++++++++++++++++----
 drivers/cpuidle/Kconfig                   |  5 ++-
 drivers/cpuidle/Makefile                  |  2 +-
 drivers/cpuidle/cpuidle-haltpoll.c        | 12 +------
 drivers/cpuidle/governors/haltpoll.c      |  6 +---
 drivers/cpuidle/poll_state.c              | 27 +++++---------
 drivers/idle/Kconfig                      |  1 +
 include/linux/cpuidle.h                   |  2 +-
 include/linux/cpuidle_haltpoll.h          |  5 +++
 17 files changed, 105 insertions(+), 50 deletions(-)
 create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h

Comments

Christoph Lameter (Ampere) Feb. 24, 2025, 4:57 p.m. UTC | #1
On Tue, 18 Feb 2025, Ankur Arora wrote:

> So, we can safely forgo the kvm_para_available() check. This also
> allows cpuidle-haltpoll to be tested on baremetal.

I would hope that we will have this functionality as the default on
baremetal after testing in the future.

Reviewed-by; Christoph Lameter (Ampere) <cl@linux.com>
Ankur Arora Feb. 25, 2025, 7:06 p.m. UTC | #2
Christoph Lameter (Ampere) <cl@gentwo.org> writes:

> On Tue, 18 Feb 2025, Ankur Arora wrote:
>
>> So, we can safely forgo the kvm_para_available() check. This also
>> allows cpuidle-haltpoll to be tested on baremetal.
>
> I would hope that we will have this functionality as the default on
> baremetal after testing in the future.

Yeah, supporting haltpoll style adaptive polling on baremetal has some
way to go.

But, with Lifeng's patch-6 "ACPI: processor_idle: Support polling state
for LPI" we do get polling support in acpi-idle.

> Reviewed-by; Christoph Lameter (Ampere) <cl@linux.com>

Thanks Christoph!

--
ankur
Shuai Xue April 11, 2025, 3:32 a.m. UTC | #3
在 2025/2/19 05:33, Ankur Arora 写道:
> Needed for cpuidle-haltpoll.
> 
> Acked-by: Will Deacon <will@kernel.org>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>   arch/arm64/kernel/idle.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm64/kernel/idle.c b/arch/arm64/kernel/idle.c
> index 05cfb347ec26..b85ba0df9b02 100644
> --- a/arch/arm64/kernel/idle.c
> +++ b/arch/arm64/kernel/idle.c
> @@ -43,3 +43,4 @@ void __cpuidle arch_cpu_idle(void)
>   	 */
>   	cpu_do_idle();

Hi, Ankur,

With haltpoll_driver registered, arch_cpu_idle() on x86 can select
mwait_idle() in idle threads.

It use MONITOR sets up an effective address range that is monitored
for write-to-memory activities; MWAIT places the processor in
an optimized state (this may vary between different implementations)
until a write to the monitored address range occurs.

Should arch_cpu_idle() on arm64 also use the LDXR/WFE
to avoid wakeup IPI like x86 monitor/mwait?

Thanks.
Shuai
Okanovic, Haris April 11, 2025, 5:42 p.m. UTC | #4
On Fri, 2025-04-11 at 11:32 +0800, Shuai Xue wrote:
> > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> > 
> > 
> > 
> > 在 2025/2/19 05:33, Ankur Arora 写道:
> > > > Needed for cpuidle-haltpoll.
> > > > 
> > > > Acked-by: Will Deacon <will@kernel.org>
> > > > Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> > > > ---
> > > >   arch/arm64/kernel/idle.c | 1 +
> > > >   1 file changed, 1 insertion(+)
> > > > 
> > > > diff --git a/arch/arm64/kernel/idle.c b/arch/arm64/kernel/idle.c
> > > > index 05cfb347ec26..b85ba0df9b02 100644
> > > > --- a/arch/arm64/kernel/idle.c
> > > > +++ b/arch/arm64/kernel/idle.c
> > > > @@ -43,3 +43,4 @@ void __cpuidle arch_cpu_idle(void)
> > > >        */
> > > >       cpu_do_idle();
> > 
> > Hi, Ankur,
> > 
> > With haltpoll_driver registered, arch_cpu_idle() on x86 can select
> > mwait_idle() in idle threads.
> > 
> > It use MONITOR sets up an effective address range that is monitored
> > for write-to-memory activities; MWAIT places the processor in
> > an optimized state (this may vary between different implementations)
> > until a write to the monitored address range occurs.
> > 
> > Should arch_cpu_idle() on arm64 also use the LDXR/WFE
> > to avoid wakeup IPI like x86 monitor/mwait?

WFE will wake from the event stream, which can have short sub-ms
periods on many systems. May be something to consider when WFET is more
widely available.

> > 
> > Thanks.
> > Shuai
> > 
> > 

Regards,
Haris Okanovic
AWS Graviton Software
Ankur Arora April 11, 2025, 8:57 p.m. UTC | #5
Shuai Xue <xueshuai@linux.alibaba.com> writes:

> 在 2025/2/19 05:33, Ankur Arora 写道:
>> Needed for cpuidle-haltpoll.
>> Acked-by: Will Deacon <will@kernel.org>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>   arch/arm64/kernel/idle.c | 1 +
>>   1 file changed, 1 insertion(+)
>> diff --git a/arch/arm64/kernel/idle.c b/arch/arm64/kernel/idle.c
>> index 05cfb347ec26..b85ba0df9b02 100644
>> --- a/arch/arm64/kernel/idle.c
>> +++ b/arch/arm64/kernel/idle.c
>> @@ -43,3 +43,4 @@ void __cpuidle arch_cpu_idle(void)
>>   	 */
>>   	cpu_do_idle();
>
> Hi, Ankur,
>
> With haltpoll_driver registered, arch_cpu_idle() on x86 can select
> mwait_idle() in idle threads.
>
> It use MONITOR sets up an effective address range that is monitored
> for write-to-memory activities; MWAIT places the processor in
> an optimized state (this may vary between different implementations)
> until a write to the monitored address range occurs.

MWAIT is more capable than WFE -- it allows selection of deeper idle
state. IIRC C2/C3.

> Should arch_cpu_idle() on arm64 also use the LDXR/WFE
> to avoid wakeup IPI like x86 monitor/mwait?

Avoiding the wakeup IPI needs TIF_NR_POLLING and polling in idle support
that this series adds.

As Haris notes, the negative with only using WFE is that it only allows
a single idle state, one that is fairly shallow because the event-stream
causes a wakeup every 100us.

--
ankur
Shuai Xue April 14, 2025, 2:01 a.m. UTC | #6
在 2025/4/12 04:57, Ankur Arora 写道:
> 
> Shuai Xue <xueshuai@linux.alibaba.com> writes:
> 
>> 在 2025/2/19 05:33, Ankur Arora 写道:
>>> Needed for cpuidle-haltpoll.
>>> Acked-by: Will Deacon <will@kernel.org>
>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>> ---
>>>    arch/arm64/kernel/idle.c | 1 +
>>>    1 file changed, 1 insertion(+)
>>> diff --git a/arch/arm64/kernel/idle.c b/arch/arm64/kernel/idle.c
>>> index 05cfb347ec26..b85ba0df9b02 100644
>>> --- a/arch/arm64/kernel/idle.c
>>> +++ b/arch/arm64/kernel/idle.c
>>> @@ -43,3 +43,4 @@ void __cpuidle arch_cpu_idle(void)
>>>    	 */
>>>    	cpu_do_idle();
>>
>> Hi, Ankur,
>>
>> With haltpoll_driver registered, arch_cpu_idle() on x86 can select
>> mwait_idle() in idle threads.
>>
>> It use MONITOR sets up an effective address range that is monitored
>> for write-to-memory activities; MWAIT places the processor in
>> an optimized state (this may vary between different implementations)
>> until a write to the monitored address range occurs.
> 
> MWAIT is more capable than WFE -- it allows selection of deeper idle
> state. IIRC C2/C3.
> 
>> Should arch_cpu_idle() on arm64 also use the LDXR/WFE
>> to avoid wakeup IPI like x86 monitor/mwait?
> 
> Avoiding the wakeup IPI needs TIF_NR_POLLING and polling in idle support
> that this series adds.
> 
> As Haris notes, the negative with only using WFE is that it only allows
> a single idle state, one that is fairly shallow because the event-stream
> causes a wakeup every 100us.
> 
> --
> ankur

Hi, Ankur and Haris

Got it, thanks for explaination :)

Comparing sched-pipe performance on Rund with Yitian 710, *IPC improved 35%*:

w/o haltpoll
Performance counter stats for 'CPU(s) 0,1' (5 runs):

     32521.53 msec task-clock                #    2.000 CPUs utilized            ( +-  1.16% )
  38081402726      cycles                    #    1.171 GHz                      ( +-  1.70% )
  27324614561      instructions              #    0.72  insn per cycle           ( +-  0.12% )
          181      sched:sched_wake_idle_without_ipi #    0.006 K/sec

w/ haltpoll
Performance counter stats for 'CPU(s) 0,1' (5 runs):

      9477.15 msec task-clock                #    2.000 CPUs utilized            ( +-  0.89% )
  21486828269      cycles                    #    2.267 GHz                      ( +-  0.35% )
  23867109747      instructions              #    1.11  insn per cycle           ( +-  0.11% )
      1925207      sched:sched_wake_idle_without_ipi #    0.203 M/sec

Comparing sched-pipe performance on QEMU with Kunpeng 920, *IPC improved 10%*:

w/o haltpoll
Performance counter stats for 'CPU(s) 0,1' (5 runs):

          34,007.89 msec task-clock                       #    2.000 CPUs utilized               ( +-  8.86% )
      4,407,859,620      cycles                           #    0.130 GHz                         ( +- 84.92% )
      2,482,046,461      instructions                     #    0.56  insn per cycle              ( +- 88.27% )
                 16      sched:sched_wake_idle_without_ipi #    0.470 /sec                        ( +- 98.77% )

              17.00 +- 1.51 seconds time elapsed  ( +-  8.86% )

w/ haltpoll
Performance counter stats for 'CPU(s) 0,1' (5 runs):

          16,894.37 msec task-clock                       #    2.000 CPUs utilized               ( +-  3.80% )
      8,703,158,826      cycles                           #    0.515 GHz                         ( +- 31.31% )
      5,379,257,839      instructions                     #    0.62  insn per cycle              ( +- 30.03% )
            549,434      sched:sched_wake_idle_without_ipi #   32.522 K/sec                       ( +- 30.05% )

              8.447 +- 0.321 seconds time elapsed  ( +-  3.80% )

Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>

Thanks.
Shuai
Ankur Arora April 14, 2025, 3:46 a.m. UTC | #7
Shuai Xue <xueshuai@linux.alibaba.com> writes:

> 在 2025/4/12 04:57, Ankur Arora 写道:
>> Shuai Xue <xueshuai@linux.alibaba.com> writes:
>>
>>> 在 2025/2/19 05:33, Ankur Arora 写道:
>>>> Needed for cpuidle-haltpoll.
>>>> Acked-by: Will Deacon <will@kernel.org>
>>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>>> ---
>>>>    arch/arm64/kernel/idle.c | 1 +
>>>>    1 file changed, 1 insertion(+)
>>>> diff --git a/arch/arm64/kernel/idle.c b/arch/arm64/kernel/idle.c
>>>> index 05cfb347ec26..b85ba0df9b02 100644
>>>> --- a/arch/arm64/kernel/idle.c
>>>> +++ b/arch/arm64/kernel/idle.c
>>>> @@ -43,3 +43,4 @@ void __cpuidle arch_cpu_idle(void)
>>>>    	 */
>>>>    	cpu_do_idle();
>>>
>>> Hi, Ankur,
>>>
>>> With haltpoll_driver registered, arch_cpu_idle() on x86 can select
>>> mwait_idle() in idle threads.
>>>
>>> It use MONITOR sets up an effective address range that is monitored
>>> for write-to-memory activities; MWAIT places the processor in
>>> an optimized state (this may vary between different implementations)
>>> until a write to the monitored address range occurs.
>> MWAIT is more capable than WFE -- it allows selection of deeper idle
>> state. IIRC C2/C3.
>>
>>> Should arch_cpu_idle() on arm64 also use the LDXR/WFE
>>> to avoid wakeup IPI like x86 monitor/mwait?
>> Avoiding the wakeup IPI needs TIF_NR_POLLING and polling in idle support
>> that this series adds.
>> As Haris notes, the negative with only using WFE is that it only allows
>> a single idle state, one that is fairly shallow because the event-stream
>> causes a wakeup every 100us.
>> --
>> ankur
>
> Hi, Ankur and Haris
>
> Got it, thanks for explaination :)
>
> Comparing sched-pipe performance on Rund with Yitian 710, *IPC improved 35%*:

Thanks for testing Shuai. I wasn't expecting the IPC to improve by quite
that much :). The reduced instructions make sense since we don't have to
handle the IRQ anymore but we would spend some of the saved cycles
waiting in WFE instead.

I'm not familiar with the Yitian 710. Can you check if you are running
with WFE? That's the __smp_cond_load_relaxed_timewait() path vs the
__smp_cond_load_relaxed_spinwait() path in [0]. Same question for the
Kunpeng 920.

Also, I'm working on a new version of the series in [1]. Would you be
okay trying that out?

Thanks
Ankur

[0] https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/
[1] https://lore.kernel.org/lkml/20250203214911.898276-4-ankur.a.arora@oracle.com/

> w/o haltpoll
> Performance counter stats for 'CPU(s) 0,1' (5 runs):
>
>     32521.53 msec task-clock                #    2.000 CPUs utilized            ( +-  1.16% )
>  38081402726      cycles                    #    1.171 GHz                      ( +-  1.70% )
>  27324614561      instructions              #    0.72  insn per cycle           ( +-  0.12% )
>          181      sched:sched_wake_idle_without_ipi #    0.006 K/sec
>
> w/ haltpoll
> Performance counter stats for 'CPU(s) 0,1' (5 runs):
>
>      9477.15 msec task-clock                #    2.000 CPUs utilized            ( +-  0.89% )
>  21486828269      cycles                    #    2.267 GHz                      ( +-  0.35% )
>  23867109747      instructions              #    1.11  insn per cycle           ( +-  0.11% )
>      1925207      sched:sched_wake_idle_without_ipi #    0.203 M/sec
>
> Comparing sched-pipe performance on QEMU with Kunpeng 920, *IPC improved 10%*:
>
> w/o haltpoll
> Performance counter stats for 'CPU(s) 0,1' (5 runs):
>
>          34,007.89 msec task-clock                       #    2.000 CPUs utilized               ( +-  8.86% )
>      4,407,859,620      cycles                           #    0.130 GHz                         ( +- 84.92% )
>      2,482,046,461      instructions                     #    0.56  insn per cycle              ( +- 88.27% )
>                 16      sched:sched_wake_idle_without_ipi #    0.470 /sec                        ( +- 98.77% )
>
>              17.00 +- 1.51 seconds time elapsed  ( +-  8.86% )
>
> w/ haltpoll
> Performance counter stats for 'CPU(s) 0,1' (5 runs):
>
>          16,894.37 msec task-clock                       #    2.000 CPUs utilized               ( +-  3.80% )
>      8,703,158,826      cycles                           #    0.515 GHz                         ( +- 31.31% )
>      5,379,257,839      instructions                     #    0.62  insn per cycle              ( +- 30.03% )
>            549,434      sched:sched_wake_idle_without_ipi #   32.522 K/sec                       ( +- 30.05% )
>
>              8.447 +- 0.321 seconds time elapsed  ( +-  3.80% )
>
> Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>
>
> Thanks.
> Shuai
Shuai Xue April 14, 2025, 7:43 a.m. UTC | #8
在 2025/4/14 11:46, Ankur Arora 写道:
> 
> Shuai Xue <xueshuai@linux.alibaba.com> writes:
> 
>> 在 2025/4/12 04:57, Ankur Arora 写道:
>>> Shuai Xue <xueshuai@linux.alibaba.com> writes:
>>>
>>>> 在 2025/2/19 05:33, Ankur Arora 写道:
>>>>> Needed for cpuidle-haltpoll.
>>>>> Acked-by: Will Deacon <will@kernel.org>
>>>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>>>> ---
>>>>>     arch/arm64/kernel/idle.c | 1 +
>>>>>     1 file changed, 1 insertion(+)
>>>>> diff --git a/arch/arm64/kernel/idle.c b/arch/arm64/kernel/idle.c
>>>>> index 05cfb347ec26..b85ba0df9b02 100644
>>>>> --- a/arch/arm64/kernel/idle.c
>>>>> +++ b/arch/arm64/kernel/idle.c
>>>>> @@ -43,3 +43,4 @@ void __cpuidle arch_cpu_idle(void)
>>>>>     	 */
>>>>>     	cpu_do_idle();
>>>>
>>>> Hi, Ankur,
>>>>
>>>> With haltpoll_driver registered, arch_cpu_idle() on x86 can select
>>>> mwait_idle() in idle threads.
>>>>
>>>> It use MONITOR sets up an effective address range that is monitored
>>>> for write-to-memory activities; MWAIT places the processor in
>>>> an optimized state (this may vary between different implementations)
>>>> until a write to the monitored address range occurs.
>>> MWAIT is more capable than WFE -- it allows selection of deeper idle
>>> state. IIRC C2/C3.
>>>
>>>> Should arch_cpu_idle() on arm64 also use the LDXR/WFE
>>>> to avoid wakeup IPI like x86 monitor/mwait?
>>> Avoiding the wakeup IPI needs TIF_NR_POLLING and polling in idle support
>>> that this series adds.
>>> As Haris notes, the negative with only using WFE is that it only allows
>>> a single idle state, one that is fairly shallow because the event-stream
>>> causes a wakeup every 100us.
>>> --
>>> ankur
>>
>> Hi, Ankur and Haris
>>
>> Got it, thanks for explaination :)
>>
>> Comparing sched-pipe performance on Rund with Yitian 710, *IPC improved 35%*:
> 
> Thanks for testing Shuai. I wasn't expecting the IPC to improve by quite
> that much :). The reduced instructions make sense since we don't have to
> handle the IRQ anymore but we would spend some of the saved cycles
> waiting in WFE instead.
> 
> I'm not familiar with the Yitian 710. Can you check if you are running
> with WFE? That's the __smp_cond_load_relaxed_timewait() path vs the
> __smp_cond_load_relaxed_spinwait() path in [0]. Same question for the
> Kunpeng 920.

Yes, it running with __smp_cond_load_relaxed_timewait().

I use perf-probe to check if WFE is available in Guest:

perf probe 'arch_timer_evtstrm_available%return r=$retval'
perf record -e probe:arch_timer_evtstrm_available__return -aR sleep 1
perf script
swapper       0 [000]  1360.063049: probe:arch_timer_evtstrm_available__return: (ffff800080a5c640 <- ffff800080d42764) r=0x1

arch_timer_evtstrm_available returns true, so
__smp_cond_load_relaxed_timewait() is used.

> 
> Also, I'm working on a new version of the series in [1]. Would you be
> okay trying that out?

Sure. Please cc me when you send out a new version.

> 
> Thanks
> Ankur
> 
> [0] https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/
> [1] https://lore.kernel.org/lkml/20250203214911.898276-4-ankur.a.arora@oracle.com/
> 

Thanks.
Shuai
Ankur Arora April 15, 2025, 6:24 a.m. UTC | #9
Shuai Xue <xueshuai@linux.alibaba.com> writes:

> 在 2025/4/14 11:46, Ankur Arora 写道:
>> Shuai Xue <xueshuai@linux.alibaba.com> writes:
>>
>>> 在 2025/4/12 04:57, Ankur Arora 写道:
>>>> Shuai Xue <xueshuai@linux.alibaba.com> writes:
>>>>
>>>>> 在 2025/2/19 05:33, Ankur Arora 写道:
>>>>>> Needed for cpuidle-haltpoll.
>>>>>> Acked-by: Will Deacon <will@kernel.org>
>>>>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>>>>> ---
>>>>>>     arch/arm64/kernel/idle.c | 1 +
>>>>>>     1 file changed, 1 insertion(+)
>>>>>> diff --git a/arch/arm64/kernel/idle.c b/arch/arm64/kernel/idle.c
>>>>>> index 05cfb347ec26..b85ba0df9b02 100644
>>>>>> --- a/arch/arm64/kernel/idle.c
>>>>>> +++ b/arch/arm64/kernel/idle.c
>>>>>> @@ -43,3 +43,4 @@ void __cpuidle arch_cpu_idle(void)
>>>>>>     	 */
>>>>>>     	cpu_do_idle();
>>>>>
>>>>> Hi, Ankur,
>>>>>
>>>>> With haltpoll_driver registered, arch_cpu_idle() on x86 can select
>>>>> mwait_idle() in idle threads.
>>>>>
>>>>> It use MONITOR sets up an effective address range that is monitored
>>>>> for write-to-memory activities; MWAIT places the processor in
>>>>> an optimized state (this may vary between different implementations)
>>>>> until a write to the monitored address range occurs.
>>>> MWAIT is more capable than WFE -- it allows selection of deeper idle
>>>> state. IIRC C2/C3.
>>>>
>>>>> Should arch_cpu_idle() on arm64 also use the LDXR/WFE
>>>>> to avoid wakeup IPI like x86 monitor/mwait?
>>>> Avoiding the wakeup IPI needs TIF_NR_POLLING and polling in idle support
>>>> that this series adds.
>>>> As Haris notes, the negative with only using WFE is that it only allows
>>>> a single idle state, one that is fairly shallow because the event-stream
>>>> causes a wakeup every 100us.
>>>> --
>>>> ankur
>>>
>>> Hi, Ankur and Haris
>>>
>>> Got it, thanks for explaination :)
>>>
>>> Comparing sched-pipe performance on Rund with Yitian 710, *IPC improved 35%*:
>> Thanks for testing Shuai. I wasn't expecting the IPC to improve by quite
>> that much :). The reduced instructions make sense since we don't have to
>> handle the IRQ anymore but we would spend some of the saved cycles
>> waiting in WFE instead.
>> I'm not familiar with the Yitian 710. Can you check if you are running
>> with WFE? That's the __smp_cond_load_relaxed_timewait() path vs the
>> __smp_cond_load_relaxed_spinwait() path in [0]. Same question for the
>> Kunpeng 920.
>
> Yes, it running with __smp_cond_load_relaxed_timewait().
>
> I use perf-probe to check if WFE is available in Guest:
>
> perf probe 'arch_timer_evtstrm_available%return r=$retval'
> perf record -e probe:arch_timer_evtstrm_available__return -aR sleep 1
> perf script
> swapper       0 [000]  1360.063049: probe:arch_timer_evtstrm_available__return: (ffff800080a5c640 <- ffff800080d42764) r=0x1
>
> arch_timer_evtstrm_available returns true, so
> __smp_cond_load_relaxed_timewait() is used.

Great. Thanks for checking.

>> Also, I'm working on a new version of the series in [1]. Would you be
>> okay trying that out?
>
> Sure. Please cc me when you send out a new version.

Will do. Thanks!

--
ankur