Message ID | 20250218213337.377987-1-ankur.a.arora@oracle.com |
---|---|
Headers | show |
Series | arm64: support poll_idle() | expand |
On Tue, 18 Feb 2025, Ankur Arora wrote: > So, we can safely forgo the kvm_para_available() check. This also > allows cpuidle-haltpoll to be tested on baremetal. I would hope that we will have this functionality as the default on baremetal after testing in the future. Reviewed-by; Christoph Lameter (Ampere) <cl@linux.com>
Christoph Lameter (Ampere) <cl@gentwo.org> writes: > On Tue, 18 Feb 2025, Ankur Arora wrote: > >> So, we can safely forgo the kvm_para_available() check. This also >> allows cpuidle-haltpoll to be tested on baremetal. > > I would hope that we will have this functionality as the default on > baremetal after testing in the future. Yeah, supporting haltpoll style adaptive polling on baremetal has some way to go. But, with Lifeng's patch-6 "ACPI: processor_idle: Support polling state for LPI" we do get polling support in acpi-idle. > Reviewed-by; Christoph Lameter (Ampere) <cl@linux.com> Thanks Christoph! -- ankur
在 2025/2/19 05:33, Ankur Arora 写道: > Needed for cpuidle-haltpoll. > > Acked-by: Will Deacon <will@kernel.org> > Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> > --- > arch/arm64/kernel/idle.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/arch/arm64/kernel/idle.c b/arch/arm64/kernel/idle.c > index 05cfb347ec26..b85ba0df9b02 100644 > --- a/arch/arm64/kernel/idle.c > +++ b/arch/arm64/kernel/idle.c > @@ -43,3 +43,4 @@ void __cpuidle arch_cpu_idle(void) > */ > cpu_do_idle(); Hi, Ankur, With haltpoll_driver registered, arch_cpu_idle() on x86 can select mwait_idle() in idle threads. It use MONITOR sets up an effective address range that is monitored for write-to-memory activities; MWAIT places the processor in an optimized state (this may vary between different implementations) until a write to the monitored address range occurs. Should arch_cpu_idle() on arm64 also use the LDXR/WFE to avoid wakeup IPI like x86 monitor/mwait? Thanks. Shuai
On Fri, 2025-04-11 at 11:32 +0800, Shuai Xue wrote: > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > > > > > 在 2025/2/19 05:33, Ankur Arora 写道: > > > > Needed for cpuidle-haltpoll. > > > > > > > > Acked-by: Will Deacon <will@kernel.org> > > > > Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> > > > > --- > > > > arch/arm64/kernel/idle.c | 1 + > > > > 1 file changed, 1 insertion(+) > > > > > > > > diff --git a/arch/arm64/kernel/idle.c b/arch/arm64/kernel/idle.c > > > > index 05cfb347ec26..b85ba0df9b02 100644 > > > > --- a/arch/arm64/kernel/idle.c > > > > +++ b/arch/arm64/kernel/idle.c > > > > @@ -43,3 +43,4 @@ void __cpuidle arch_cpu_idle(void) > > > > */ > > > > cpu_do_idle(); > > > > Hi, Ankur, > > > > With haltpoll_driver registered, arch_cpu_idle() on x86 can select > > mwait_idle() in idle threads. > > > > It use MONITOR sets up an effective address range that is monitored > > for write-to-memory activities; MWAIT places the processor in > > an optimized state (this may vary between different implementations) > > until a write to the monitored address range occurs. > > > > Should arch_cpu_idle() on arm64 also use the LDXR/WFE > > to avoid wakeup IPI like x86 monitor/mwait? WFE will wake from the event stream, which can have short sub-ms periods on many systems. May be something to consider when WFET is more widely available. > > > > Thanks. > > Shuai > > > > Regards, Haris Okanovic AWS Graviton Software
Shuai Xue <xueshuai@linux.alibaba.com> writes: > 在 2025/2/19 05:33, Ankur Arora 写道: >> Needed for cpuidle-haltpoll. >> Acked-by: Will Deacon <will@kernel.org> >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> >> --- >> arch/arm64/kernel/idle.c | 1 + >> 1 file changed, 1 insertion(+) >> diff --git a/arch/arm64/kernel/idle.c b/arch/arm64/kernel/idle.c >> index 05cfb347ec26..b85ba0df9b02 100644 >> --- a/arch/arm64/kernel/idle.c >> +++ b/arch/arm64/kernel/idle.c >> @@ -43,3 +43,4 @@ void __cpuidle arch_cpu_idle(void) >> */ >> cpu_do_idle(); > > Hi, Ankur, > > With haltpoll_driver registered, arch_cpu_idle() on x86 can select > mwait_idle() in idle threads. > > It use MONITOR sets up an effective address range that is monitored > for write-to-memory activities; MWAIT places the processor in > an optimized state (this may vary between different implementations) > until a write to the monitored address range occurs. MWAIT is more capable than WFE -- it allows selection of deeper idle state. IIRC C2/C3. > Should arch_cpu_idle() on arm64 also use the LDXR/WFE > to avoid wakeup IPI like x86 monitor/mwait? Avoiding the wakeup IPI needs TIF_NR_POLLING and polling in idle support that this series adds. As Haris notes, the negative with only using WFE is that it only allows a single idle state, one that is fairly shallow because the event-stream causes a wakeup every 100us. -- ankur
在 2025/4/12 04:57, Ankur Arora 写道: > > Shuai Xue <xueshuai@linux.alibaba.com> writes: > >> 在 2025/2/19 05:33, Ankur Arora 写道: >>> Needed for cpuidle-haltpoll. >>> Acked-by: Will Deacon <will@kernel.org> >>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> >>> --- >>> arch/arm64/kernel/idle.c | 1 + >>> 1 file changed, 1 insertion(+) >>> diff --git a/arch/arm64/kernel/idle.c b/arch/arm64/kernel/idle.c >>> index 05cfb347ec26..b85ba0df9b02 100644 >>> --- a/arch/arm64/kernel/idle.c >>> +++ b/arch/arm64/kernel/idle.c >>> @@ -43,3 +43,4 @@ void __cpuidle arch_cpu_idle(void) >>> */ >>> cpu_do_idle(); >> >> Hi, Ankur, >> >> With haltpoll_driver registered, arch_cpu_idle() on x86 can select >> mwait_idle() in idle threads. >> >> It use MONITOR sets up an effective address range that is monitored >> for write-to-memory activities; MWAIT places the processor in >> an optimized state (this may vary between different implementations) >> until a write to the monitored address range occurs. > > MWAIT is more capable than WFE -- it allows selection of deeper idle > state. IIRC C2/C3. > >> Should arch_cpu_idle() on arm64 also use the LDXR/WFE >> to avoid wakeup IPI like x86 monitor/mwait? > > Avoiding the wakeup IPI needs TIF_NR_POLLING and polling in idle support > that this series adds. > > As Haris notes, the negative with only using WFE is that it only allows > a single idle state, one that is fairly shallow because the event-stream > causes a wakeup every 100us. > > -- > ankur Hi, Ankur and Haris Got it, thanks for explaination :) Comparing sched-pipe performance on Rund with Yitian 710, *IPC improved 35%*: w/o haltpoll Performance counter stats for 'CPU(s) 0,1' (5 runs): 32521.53 msec task-clock # 2.000 CPUs utilized ( +- 1.16% ) 38081402726 cycles # 1.171 GHz ( +- 1.70% ) 27324614561 instructions # 0.72 insn per cycle ( +- 0.12% ) 181 sched:sched_wake_idle_without_ipi # 0.006 K/sec w/ haltpoll Performance counter stats for 'CPU(s) 0,1' (5 runs): 9477.15 msec task-clock # 2.000 CPUs utilized ( +- 0.89% ) 21486828269 cycles # 2.267 GHz ( +- 0.35% ) 23867109747 instructions # 1.11 insn per cycle ( +- 0.11% ) 1925207 sched:sched_wake_idle_without_ipi # 0.203 M/sec Comparing sched-pipe performance on QEMU with Kunpeng 920, *IPC improved 10%*: w/o haltpoll Performance counter stats for 'CPU(s) 0,1' (5 runs): 34,007.89 msec task-clock # 2.000 CPUs utilized ( +- 8.86% ) 4,407,859,620 cycles # 0.130 GHz ( +- 84.92% ) 2,482,046,461 instructions # 0.56 insn per cycle ( +- 88.27% ) 16 sched:sched_wake_idle_without_ipi # 0.470 /sec ( +- 98.77% ) 17.00 +- 1.51 seconds time elapsed ( +- 8.86% ) w/ haltpoll Performance counter stats for 'CPU(s) 0,1' (5 runs): 16,894.37 msec task-clock # 2.000 CPUs utilized ( +- 3.80% ) 8,703,158,826 cycles # 0.515 GHz ( +- 31.31% ) 5,379,257,839 instructions # 0.62 insn per cycle ( +- 30.03% ) 549,434 sched:sched_wake_idle_without_ipi # 32.522 K/sec ( +- 30.05% ) 8.447 +- 0.321 seconds time elapsed ( +- 3.80% ) Tested-by: Shuai Xue <xueshuai@linux.alibaba.com> Thanks. Shuai
Shuai Xue <xueshuai@linux.alibaba.com> writes: > 在 2025/4/12 04:57, Ankur Arora 写道: >> Shuai Xue <xueshuai@linux.alibaba.com> writes: >> >>> 在 2025/2/19 05:33, Ankur Arora 写道: >>>> Needed for cpuidle-haltpoll. >>>> Acked-by: Will Deacon <will@kernel.org> >>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> >>>> --- >>>> arch/arm64/kernel/idle.c | 1 + >>>> 1 file changed, 1 insertion(+) >>>> diff --git a/arch/arm64/kernel/idle.c b/arch/arm64/kernel/idle.c >>>> index 05cfb347ec26..b85ba0df9b02 100644 >>>> --- a/arch/arm64/kernel/idle.c >>>> +++ b/arch/arm64/kernel/idle.c >>>> @@ -43,3 +43,4 @@ void __cpuidle arch_cpu_idle(void) >>>> */ >>>> cpu_do_idle(); >>> >>> Hi, Ankur, >>> >>> With haltpoll_driver registered, arch_cpu_idle() on x86 can select >>> mwait_idle() in idle threads. >>> >>> It use MONITOR sets up an effective address range that is monitored >>> for write-to-memory activities; MWAIT places the processor in >>> an optimized state (this may vary between different implementations) >>> until a write to the monitored address range occurs. >> MWAIT is more capable than WFE -- it allows selection of deeper idle >> state. IIRC C2/C3. >> >>> Should arch_cpu_idle() on arm64 also use the LDXR/WFE >>> to avoid wakeup IPI like x86 monitor/mwait? >> Avoiding the wakeup IPI needs TIF_NR_POLLING and polling in idle support >> that this series adds. >> As Haris notes, the negative with only using WFE is that it only allows >> a single idle state, one that is fairly shallow because the event-stream >> causes a wakeup every 100us. >> -- >> ankur > > Hi, Ankur and Haris > > Got it, thanks for explaination :) > > Comparing sched-pipe performance on Rund with Yitian 710, *IPC improved 35%*: Thanks for testing Shuai. I wasn't expecting the IPC to improve by quite that much :). The reduced instructions make sense since we don't have to handle the IRQ anymore but we would spend some of the saved cycles waiting in WFE instead. I'm not familiar with the Yitian 710. Can you check if you are running with WFE? That's the __smp_cond_load_relaxed_timewait() path vs the __smp_cond_load_relaxed_spinwait() path in [0]. Same question for the Kunpeng 920. Also, I'm working on a new version of the series in [1]. Would you be okay trying that out? Thanks Ankur [0] https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/ [1] https://lore.kernel.org/lkml/20250203214911.898276-4-ankur.a.arora@oracle.com/ > w/o haltpoll > Performance counter stats for 'CPU(s) 0,1' (5 runs): > > 32521.53 msec task-clock # 2.000 CPUs utilized ( +- 1.16% ) > 38081402726 cycles # 1.171 GHz ( +- 1.70% ) > 27324614561 instructions # 0.72 insn per cycle ( +- 0.12% ) > 181 sched:sched_wake_idle_without_ipi # 0.006 K/sec > > w/ haltpoll > Performance counter stats for 'CPU(s) 0,1' (5 runs): > > 9477.15 msec task-clock # 2.000 CPUs utilized ( +- 0.89% ) > 21486828269 cycles # 2.267 GHz ( +- 0.35% ) > 23867109747 instructions # 1.11 insn per cycle ( +- 0.11% ) > 1925207 sched:sched_wake_idle_without_ipi # 0.203 M/sec > > Comparing sched-pipe performance on QEMU with Kunpeng 920, *IPC improved 10%*: > > w/o haltpoll > Performance counter stats for 'CPU(s) 0,1' (5 runs): > > 34,007.89 msec task-clock # 2.000 CPUs utilized ( +- 8.86% ) > 4,407,859,620 cycles # 0.130 GHz ( +- 84.92% ) > 2,482,046,461 instructions # 0.56 insn per cycle ( +- 88.27% ) > 16 sched:sched_wake_idle_without_ipi # 0.470 /sec ( +- 98.77% ) > > 17.00 +- 1.51 seconds time elapsed ( +- 8.86% ) > > w/ haltpoll > Performance counter stats for 'CPU(s) 0,1' (5 runs): > > 16,894.37 msec task-clock # 2.000 CPUs utilized ( +- 3.80% ) > 8,703,158,826 cycles # 0.515 GHz ( +- 31.31% ) > 5,379,257,839 instructions # 0.62 insn per cycle ( +- 30.03% ) > 549,434 sched:sched_wake_idle_without_ipi # 32.522 K/sec ( +- 30.05% ) > > 8.447 +- 0.321 seconds time elapsed ( +- 3.80% ) > > Tested-by: Shuai Xue <xueshuai@linux.alibaba.com> > > Thanks. > Shuai
在 2025/4/14 11:46, Ankur Arora 写道: > > Shuai Xue <xueshuai@linux.alibaba.com> writes: > >> 在 2025/4/12 04:57, Ankur Arora 写道: >>> Shuai Xue <xueshuai@linux.alibaba.com> writes: >>> >>>> 在 2025/2/19 05:33, Ankur Arora 写道: >>>>> Needed for cpuidle-haltpoll. >>>>> Acked-by: Will Deacon <will@kernel.org> >>>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> >>>>> --- >>>>> arch/arm64/kernel/idle.c | 1 + >>>>> 1 file changed, 1 insertion(+) >>>>> diff --git a/arch/arm64/kernel/idle.c b/arch/arm64/kernel/idle.c >>>>> index 05cfb347ec26..b85ba0df9b02 100644 >>>>> --- a/arch/arm64/kernel/idle.c >>>>> +++ b/arch/arm64/kernel/idle.c >>>>> @@ -43,3 +43,4 @@ void __cpuidle arch_cpu_idle(void) >>>>> */ >>>>> cpu_do_idle(); >>>> >>>> Hi, Ankur, >>>> >>>> With haltpoll_driver registered, arch_cpu_idle() on x86 can select >>>> mwait_idle() in idle threads. >>>> >>>> It use MONITOR sets up an effective address range that is monitored >>>> for write-to-memory activities; MWAIT places the processor in >>>> an optimized state (this may vary between different implementations) >>>> until a write to the monitored address range occurs. >>> MWAIT is more capable than WFE -- it allows selection of deeper idle >>> state. IIRC C2/C3. >>> >>>> Should arch_cpu_idle() on arm64 also use the LDXR/WFE >>>> to avoid wakeup IPI like x86 monitor/mwait? >>> Avoiding the wakeup IPI needs TIF_NR_POLLING and polling in idle support >>> that this series adds. >>> As Haris notes, the negative with only using WFE is that it only allows >>> a single idle state, one that is fairly shallow because the event-stream >>> causes a wakeup every 100us. >>> -- >>> ankur >> >> Hi, Ankur and Haris >> >> Got it, thanks for explaination :) >> >> Comparing sched-pipe performance on Rund with Yitian 710, *IPC improved 35%*: > > Thanks for testing Shuai. I wasn't expecting the IPC to improve by quite > that much :). The reduced instructions make sense since we don't have to > handle the IRQ anymore but we would spend some of the saved cycles > waiting in WFE instead. > > I'm not familiar with the Yitian 710. Can you check if you are running > with WFE? That's the __smp_cond_load_relaxed_timewait() path vs the > __smp_cond_load_relaxed_spinwait() path in [0]. Same question for the > Kunpeng 920. Yes, it running with __smp_cond_load_relaxed_timewait(). I use perf-probe to check if WFE is available in Guest: perf probe 'arch_timer_evtstrm_available%return r=$retval' perf record -e probe:arch_timer_evtstrm_available__return -aR sleep 1 perf script swapper 0 [000] 1360.063049: probe:arch_timer_evtstrm_available__return: (ffff800080a5c640 <- ffff800080d42764) r=0x1 arch_timer_evtstrm_available returns true, so __smp_cond_load_relaxed_timewait() is used. > > Also, I'm working on a new version of the series in [1]. Would you be > okay trying that out? Sure. Please cc me when you send out a new version. > > Thanks > Ankur > > [0] https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/ > [1] https://lore.kernel.org/lkml/20250203214911.898276-4-ankur.a.arora@oracle.com/ > Thanks. Shuai
Shuai Xue <xueshuai@linux.alibaba.com> writes: > 在 2025/4/14 11:46, Ankur Arora 写道: >> Shuai Xue <xueshuai@linux.alibaba.com> writes: >> >>> 在 2025/4/12 04:57, Ankur Arora 写道: >>>> Shuai Xue <xueshuai@linux.alibaba.com> writes: >>>> >>>>> 在 2025/2/19 05:33, Ankur Arora 写道: >>>>>> Needed for cpuidle-haltpoll. >>>>>> Acked-by: Will Deacon <will@kernel.org> >>>>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> >>>>>> --- >>>>>> arch/arm64/kernel/idle.c | 1 + >>>>>> 1 file changed, 1 insertion(+) >>>>>> diff --git a/arch/arm64/kernel/idle.c b/arch/arm64/kernel/idle.c >>>>>> index 05cfb347ec26..b85ba0df9b02 100644 >>>>>> --- a/arch/arm64/kernel/idle.c >>>>>> +++ b/arch/arm64/kernel/idle.c >>>>>> @@ -43,3 +43,4 @@ void __cpuidle arch_cpu_idle(void) >>>>>> */ >>>>>> cpu_do_idle(); >>>>> >>>>> Hi, Ankur, >>>>> >>>>> With haltpoll_driver registered, arch_cpu_idle() on x86 can select >>>>> mwait_idle() in idle threads. >>>>> >>>>> It use MONITOR sets up an effective address range that is monitored >>>>> for write-to-memory activities; MWAIT places the processor in >>>>> an optimized state (this may vary between different implementations) >>>>> until a write to the monitored address range occurs. >>>> MWAIT is more capable than WFE -- it allows selection of deeper idle >>>> state. IIRC C2/C3. >>>> >>>>> Should arch_cpu_idle() on arm64 also use the LDXR/WFE >>>>> to avoid wakeup IPI like x86 monitor/mwait? >>>> Avoiding the wakeup IPI needs TIF_NR_POLLING and polling in idle support >>>> that this series adds. >>>> As Haris notes, the negative with only using WFE is that it only allows >>>> a single idle state, one that is fairly shallow because the event-stream >>>> causes a wakeup every 100us. >>>> -- >>>> ankur >>> >>> Hi, Ankur and Haris >>> >>> Got it, thanks for explaination :) >>> >>> Comparing sched-pipe performance on Rund with Yitian 710, *IPC improved 35%*: >> Thanks for testing Shuai. I wasn't expecting the IPC to improve by quite >> that much :). The reduced instructions make sense since we don't have to >> handle the IRQ anymore but we would spend some of the saved cycles >> waiting in WFE instead. >> I'm not familiar with the Yitian 710. Can you check if you are running >> with WFE? That's the __smp_cond_load_relaxed_timewait() path vs the >> __smp_cond_load_relaxed_spinwait() path in [0]. Same question for the >> Kunpeng 920. > > Yes, it running with __smp_cond_load_relaxed_timewait(). > > I use perf-probe to check if WFE is available in Guest: > > perf probe 'arch_timer_evtstrm_available%return r=$retval' > perf record -e probe:arch_timer_evtstrm_available__return -aR sleep 1 > perf script > swapper 0 [000] 1360.063049: probe:arch_timer_evtstrm_available__return: (ffff800080a5c640 <- ffff800080d42764) r=0x1 > > arch_timer_evtstrm_available returns true, so > __smp_cond_load_relaxed_timewait() is used. Great. Thanks for checking. >> Also, I'm working on a new version of the series in [1]. Would you be >> okay trying that out? > > Sure. Please cc me when you send out a new version. Will do. Thanks! -- ankur