mbox series

[V2,0/3] cpufreq: cppc: Add support for frequency invariance

Message ID cover.1623825725.git.viresh.kumar@linaro.org
Headers show
Series cpufreq: cppc: Add support for frequency invariance | expand

Message

Viresh Kumar June 16, 2021, 6:48 a.m. UTC
Hello,

Changes since V1:

- Few of the patches migrating users to ->exit() callback are posted separately.

- The CPPC patch was completely reverted and so the support for FIE is again
  added here from scratch.

- The start_cpu() and stop_cpu() interface is reworked a little so stop_cpu() is
  only ever called for a CPU if start_cpu() was called for it earlier.

- A new patch to implement RCU locking in arch_topology core to avoid some
  races.

- Some cleanup and very clear/separate paths for FIE in cppc driver now.


-------------------------8<-------------------------

CPPC cpufreq driver is used for ARM servers and this patch series tries to
provide counter-based frequency invariance support for them in the absence for
architecture specific counters (like AMUs).

This was reverted earlier for the 5.13 kernel after Qian Cai reported kernel
oops during suspend/resume.

This is based of pm/linux-next + a cleanup patchset:

https://lore.kernel.org/linux-pm/cover.1623825358.git.viresh.kumar@linaro.org/

All the patches are pushed here together for people to run.

https://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm.git/log/?h=cpufreq/cppc

This is tested on my Hikey platform (without the actual read/write to
performance counters), with this script for over an hour:

while true; do
    for i in `seq 1 7`;
    do
        echo 0 > /sys/devices/system/cpu/cpu$i/online;
    done;

    for i in `seq 1 7`;
    do
        echo 1 > /sys/devices/system/cpu/cpu$i/online;
    done;
done


Vincent will be giving this patchset a try on ThunderX2.

Meanwhile it is up for review. Ideally I would like to get this merged for 5.14,
but lets see how it goes.

Thanks.

--
Viresh

Viresh Kumar (3):
  cpufreq: Add start_cpu() and stop_cpu() callbacks
  arch_topology: Avoid use-after-free for scale_freq_data
  cpufreq: CPPC: Add support for frequency invariance

 Documentation/cpu-freq/cpu-drivers.rst |   7 +-
 drivers/base/arch_topology.c           |  27 ++-
 drivers/cpufreq/Kconfig.arm            |  10 ++
 drivers/cpufreq/cppc_cpufreq.c         | 232 +++++++++++++++++++++++--
 drivers/cpufreq/cpufreq.c              |  19 +-
 include/linux/arch_topology.h          |   1 +
 include/linux/cpufreq.h                |   5 +-
 kernel/sched/core.c                    |   1 +
 8 files changed, 272 insertions(+), 30 deletions(-)

Comments

Vincent Guittot June 16, 2021, 10:02 a.m. UTC | #1
On Wed, 16 Jun 2021 at 08:48, Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> Hello,
>
> Changes since V1:
>
> - Few of the patches migrating users to ->exit() callback are posted separately.
>
> - The CPPC patch was completely reverted and so the support for FIE is again
>   added here from scratch.
>
> - The start_cpu() and stop_cpu() interface is reworked a little so stop_cpu() is
>   only ever called for a CPU if start_cpu() was called for it earlier.
>
> - A new patch to implement RCU locking in arch_topology core to avoid some
>   races.
>
> - Some cleanup and very clear/separate paths for FIE in cppc driver now.
>
>
> -------------------------8<-------------------------
>
> CPPC cpufreq driver is used for ARM servers and this patch series tries to
> provide counter-based frequency invariance support for them in the absence for
> architecture specific counters (like AMUs).
>
> This was reverted earlier for the 5.13 kernel after Qian Cai reported kernel
> oops during suspend/resume.
>
> This is based of pm/linux-next + a cleanup patchset:
>
> https://lore.kernel.org/linux-pm/cover.1623825358.git.viresh.kumar@linaro.org/
>
> All the patches are pushed here together for people to run.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm.git/log/?h=cpufreq/cppc
>
> This is tested on my Hikey platform (without the actual read/write to
> performance counters), with this script for over an hour:
>
> while true; do
>     for i in `seq 1 7`;
>     do
>         echo 0 > /sys/devices/system/cpu/cpu$i/online;
>     done;
>
>     for i in `seq 1 7`;
>     do
>         echo 1 > /sys/devices/system/cpu/cpu$i/online;
>     done;
> done
>
>
> Vincent will be giving this patchset a try on ThunderX2.

I tested your branch and got the following while booting:

[   24.454543] zswap: loaded using pool lzo/zbud
[   24.454753] pstore: Using crash dump compression: deflate
[   24.454776] AppArmor: AppArmor sha1 policy hashing enabled
[   24.454784] ima: No TPM chip found, activating TPM-bypass!
[   24.454789] ima: Allocated hash algorithm: sha256
[   24.454801] ima: No architecture policies found
[   24.455750] pcieport 0000:0f:00.0: Adding to iommu group 0
[   24.893888] ------------[ cut here ]------------
[   24.893891] WARNING: CPU: 95 PID: 1442 at
drivers/cpufreq/cppc_cpufreq.c:123 cppc_scale_freq_workfn+0xc8/0xf8
[   24.893901] Modules linked in:
[   24.893906] CPU: 95 PID: 1442 Comm: cppc_fie Not tainted 5.13.0-rc6+ #359
[   24.893910] Hardware name: To be filled by O.E.M. Saber/Saber, BIOS
0ACKL026 03/19/2019
[   24.893912] pstate: 20400009 (nzCv daif +PAN -UAO -TCO BTYPE=--)
[   24.893915] pc : cppc_scale_freq_workfn+0xc8/0xf8
[   24.893918] lr : cppc_scale_freq_workfn+0x5c/0xf8
[   24.893921] sp : ffff80003727bd90
[   24.893922] x29: ffff80003727bd90 x28: 0000000000000000 x27: ffff800010ec2000
[   24.893928] x26: ffff800010ec2000 x25: ffff8000107c3d90 x24: 0000000000000001
[   24.893932] x23: ffff000816244880 x22: ffff8000113f9000 x21: ffff009f825a0a80
[   24.893935] x20: ffff009efc394220 x19: ffff800011199000 x18: 000000000000001b
[   24.893939] x17: 0000000000000007 x16: 0000000000000001 x15: 00000000000000bf
[   24.893943] x14: 0000000000000016 x13: 000000000000029b x12: 0000000000000016
[   24.893946] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff009efc6958c0
[   24.893950] x8 : ffff009efc394248 x7 : 0000000002bde780 x6 : 00000000ffffffff
[   24.893954] x5 : 00000000916e502a x4 : 00000000d9730e80 x3 : ffffffffffffffff
[   24.893958] x2 : 00000000001e8480 x1 : 00000000002625a0 x0 : 0000000000000401
[   24.893962] Call trace:
[   24.893964]  cppc_scale_freq_workfn+0xc8/0xf8
[   24.893967]  kthread_worker_fn+0x110/0x318
[   24.893971]  kthread+0xf4/0x120
[   24.893973]  ret_from_fork+0x10/0x18
[   24.893977] ---[ end trace ea6dbaf832bce3e4 ]---


>
> Meanwhile it is up for review. Ideally I would like to get this merged for 5.14,
> but lets see how it goes.
>
> Thanks.
>
> --
> Viresh
>
> Viresh Kumar (3):
>   cpufreq: Add start_cpu() and stop_cpu() callbacks
>   arch_topology: Avoid use-after-free for scale_freq_data
>   cpufreq: CPPC: Add support for frequency invariance
>
>  Documentation/cpu-freq/cpu-drivers.rst |   7 +-
>  drivers/base/arch_topology.c           |  27 ++-
>  drivers/cpufreq/Kconfig.arm            |  10 ++
>  drivers/cpufreq/cppc_cpufreq.c         | 232 +++++++++++++++++++++++--
>  drivers/cpufreq/cpufreq.c              |  19 +-
>  include/linux/arch_topology.h          |   1 +
>  include/linux/cpufreq.h                |   5 +-
>  kernel/sched/core.c                    |   1 +
>  8 files changed, 272 insertions(+), 30 deletions(-)
>
> --
> 2.31.1.272.g89b43f80a514
>
Viresh Kumar June 17, 2021, 3:24 a.m. UTC | #2
On 16-06-21, 13:48, Ionela Voinescu wrote:
> I was looking forward to the complete removal of stop_cpu() :).


No one wants to remove more code than I do :)

> I'll only comment on this for now as I should know the rest.

> 

> Let's assume we don't have these, what happens now is the following:

> 

> 1. We hotplug out the last CPU in a policy, we call the

>    .stop_cpu()/exit() function which will free the cppc_cpudata structure.

> 

>    The only vulnerability is if we have a last tick on that last CPU,

>    after the above callback was called.

> 

> 2. When the CPU at 1. gets hotplugged back in, the cppc_fi->cpu_data is

>    stale.

> 

> We do not have a problem when removing the CPPC cpufreq module as we're

> doing cppc_freq_invariance_exit() before unregistering the driver and

> freeing the data.

> 

> Are 1. and 2 the only problems we have, or have I missed any?


There is more to it. For simplicity, lets assume a quad-core setup,
with all 4 CPUs sharing the cpufreq policy. And here is what happens
without the new changes:

- On CPPC cpufreq driver insertion, we start 4 kthreads/irq-works (1
  for each CPU as it fires from tick) from the ->init() callback.

- Now lets say we offline CPU3. The CPPC cpufreq driver isn't notified
  by anyone and it hasn't registered itself to hotplug notifier as
  well. So, the irq-work/kthread isn't stopped. This results in the
  issue reported by Qian earlier.

  The very same thing happens with schedutil governor too, which uses
  very similar mechanisms, and the cpufreq core takes care of it there
  by stopping the governor before removing the CPU from policy->cpus
  and starting it again. So there we stop irq-work/kthread for all 4
  CPUs, then start them only for remaining 3.

  I thought about that approach as well, but it was too heavy to stop
  everyone and start again in this case. And so I invented start_cpu()
  and stop_cpu() callbacks.

- In this case, because the CPU is going away, we need to make sure we
  don't queue any more irq-work or kthread to it and this is one of
  the main reasons for adding synchronization in the topology core,
  because we need a hard guarantee here that irq-work won't fire
  again, as the CPU won't be there or will not be in a sane state.

- The same sequence of events is true for the case where the last CPU
  of a policy goes away (not in this example, but lets say quad-core
  setup with separate policies for each CPU).

- Not just the policy, but the CPU may be going away as well.

I hope I was able to clarify a bit here.

-- 
viresh
Ionela Voinescu June 17, 2021, 10:34 a.m. UTC | #3
Many thanks for the details!

On Thursday 17 Jun 2021 at 08:54:16 (+0530), Viresh Kumar wrote:
> On 16-06-21, 13:48, Ionela Voinescu wrote:
> > I was looking forward to the complete removal of stop_cpu() :).
> 
> No one wants to remove more code than I do :)
> 
> > I'll only comment on this for now as I should know the rest.
> > 
> > Let's assume we don't have these, what happens now is the following:
> > 
> > 1. We hotplug out the last CPU in a policy, we call the
> >    .stop_cpu()/exit() function which will free the cppc_cpudata structure.
> > 
> >    The only vulnerability is if we have a last tick on that last CPU,
> >    after the above callback was called.
> > 
> > 2. When the CPU at 1. gets hotplugged back in, the cppc_fi->cpu_data is
> >    stale.
> > 
> > We do not have a problem when removing the CPPC cpufreq module as we're
> > doing cppc_freq_invariance_exit() before unregistering the driver and
> > freeing the data.
> > 
> > Are 1. and 2 the only problems we have, or have I missed any?
> 
> There is more to it. For simplicity, lets assume a quad-core setup,
> with all 4 CPUs sharing the cpufreq policy. And here is what happens
> without the new changes:
> 
> - On CPPC cpufreq driver insertion, we start 4 kthreads/irq-works (1
>   for each CPU as it fires from tick) from the ->init() callback.
> 
> - Now lets say we offline CPU3. The CPPC cpufreq driver isn't notified
>   by anyone and it hasn't registered itself to hotplug notifier as
>   well. So, the irq-work/kthread isn't stopped. This results in the
>   issue reported by Qian earlier.
> 

I might be missing something, but when you offline a single CPU in a
policy, the worse that can happen is that a last call to
cppc_scale_freq_tick() would have sneaked in before irqs and the tick
are disabled. But even if we have a last call to
cppc_scale_freq_workfn(), the counter read methods would know how to
cope with hotplug, and the cppc_cpudata structure would still be
allocated and have valid desired_perf and highest_perf values.

Worse case, the last scale factor set for the CPU will be meaningless,
but it's already meaningless as the CPU is going down.

When you are referring to the issue reported by Qian I suppose you are
referring to this [1]. I think this is the case where you hotplug the
last CPU in a policy and free cppc_cpudata.

[1] https://lore.kernel.org/linux-pm/41f5195e-0e5f-fdfe-ba37-34e1fd8e4064@quicinc.com/

>   The very same thing happens with schedutil governor too, which uses
>   very similar mechanisms, and the cpufreq core takes care of it there
>   by stopping the governor before removing the CPU from policy->cpus
>   and starting it again. So there we stop irq-work/kthread for all 4
>   CPUs, then start them only for remaining 3.
> 

Things are different for sugov: you have to stop the governor when one
CPU in the policy goes down, and it's normal for sugov not to allow its
hooks to be called while the governor is down so it has to do a full
cleanup when going down and a full bringup when going back up.

The difference for CPPC invariance is that only a CPU can trigger the
work to update its own scale factor, through the tick. No other CPU x
can trigger a scale factor update for CPU y, but x can carry out the
work for CPU y (x can run the cppc_scale_freq_workfn(y)).

So when y goes down, it won't have a tick to queue any irq or kthread
work any longer until it's brought back up. So I believe that the only
cleanup that needs to be done when a CPU goes offline, is to ensure
that the work triggered by that last tick is done safely.

>   I thought about that approach as well, but it was too heavy to stop
>   everyone and start again in this case. And so I invented start_cpu()
>   and stop_cpu() callbacks.
> 

> - In this case, because the CPU is going away, we need to make sure we
>   don't queue any more irq-work or kthread to it and this is one of
>   the main reasons for adding synchronization in the topology core,
>   because we need a hard guarantee here that irq-work won't fire
>   again, as the CPU won't be there or will not be in a sane state.
> 

We can't queue any more work for it as there's no tick for the offline
CPU.

> - The same sequence of events is true for the case where the last CPU
>   of a policy goes away (not in this example, but lets say quad-core
>   setup with separate policies for each CPU).
> 
> - Not just the policy, but the CPU may be going away as well.
> 
> I hope I was able to clarify a bit here.
> 

Thanks! I do agree it is better to be cautious, but I initially wanted to
understand we don't see the problem bigger than it actually is.

Thanks,
Ionela.

P.S. I'll give more thought to the rcu use in the arch_topology driver.
I'm the boring kind that likes to err on the side of caution, so I tend
to agree that it might be good to have even if the current problem could
be solved in this driver.


> -- 
> viresh
Peter Zijlstra June 17, 2021, 12:22 p.m. UTC | #4
On Thu, Jun 17, 2021 at 04:49:36PM +0530, Viresh Kumar wrote:
> Peter: Can you help here on this ? Lemme try to explain a bit here:

> 

> We are starting an irq-work (in cppc cpufreq driver) from

> scheduler_tick()->arch_scale_freq_tick(). What will happen if the driver doesn't

> take care of CPU hotplug explicitly and make sure this work isn't queued again

> from the next tick.

> 

> Is it important for user to make sure it gets rid of the irq-work during hotplug

> here ?


irq-work is flushed on hotplug, see smpcfd_dying_cpu().
Viresh Kumar June 18, 2021, 3:45 a.m. UTC | #5
On 17-06-21, 14:22, Peter Zijlstra wrote:
> On Thu, Jun 17, 2021 at 04:49:36PM +0530, Viresh Kumar wrote:

> > Peter: Can you help here on this ? Lemme try to explain a bit here:

> > 

> > We are starting an irq-work (in cppc cpufreq driver) from

> > scheduler_tick()->arch_scale_freq_tick(). What will happen if the driver doesn't

> > take care of CPU hotplug explicitly and make sure this work isn't queued again

> > from the next tick.

> > 

> > Is it important for user to make sure it gets rid of the irq-work during hotplug

> > here ?

> 

> irq-work is flushed on hotplug, see smpcfd_dying_cpu().


Thanks Peter.

-- 
viresh
Viresh Kumar June 18, 2021, 7:37 a.m. UTC | #6
On 17-06-21, 11:34, Ionela Voinescu wrote:
> We can't queue any more work for it as there's no tick for the offline
> CPU.

Okay, so as discussed yesterday and confirmed by Peter, CPU hotplug shouldn't be
a problem and we can avoid handling that here. Good.

The patch 1/3 from this series is not required anymore. I will get rid of
stop_cpu() as well.

The second patch stays as is, as I still don't see another way of fixing the
same problem on policy ->exit(). We still need that guarantee from topology
core.

> P.S. I'll give more thought to the rcu use in the arch_topology driver.
> I'm the boring kind that likes to err on the side of caution, so I tend
> to agree that it might be good to have even if the current problem could
> be solved in this driver.

Before I resend the series again, maybe it is better to align on the idea here
itself first (as I need to fix some existing potential memleaks in policy
->init() first). So here is the new diff, looks okay now ?

diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
index be4f62e2c5f1..3e9070f107c5 100644
--- a/drivers/cpufreq/cppc_cpufreq.c
+++ b/drivers/cpufreq/cppc_cpufreq.c
@@ -10,14 +10,18 @@
 
 #define pr_fmt(fmt)	"CPPC Cpufreq:"	fmt
 
+#include <linux/arch_topology.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/delay.h>
 #include <linux/cpu.h>
 #include <linux/cpufreq.h>
 #include <linux/dmi.h>
+#include <linux/irq_work.h>
+#include <linux/kthread.h>
 #include <linux/time.h>
 #include <linux/vmalloc.h>
+#include <uapi/linux/sched/types.h>
 
 #include <asm/unaligned.h>
 
@@ -57,6 +61,214 @@ static struct cppc_workaround_oem_info wa_info[] = {
 	}
 };
 
+#ifdef CONFIG_ACPI_CPPC_CPUFREQ_FIE
+
+/* Frequency invariance support */
+struct cppc_freq_invariance {
+	int cpu;
+	struct irq_work irq_work;
+	struct kthread_work work;
+	struct cppc_perf_fb_ctrs prev_perf_fb_ctrs;
+	struct cppc_cpudata *cpu_data;
+};
+
+static DEFINE_PER_CPU(struct cppc_freq_invariance, cppc_freq_inv);
+static struct kthread_worker *kworker_fie;
+
+static struct cpufreq_driver cppc_cpufreq_driver;
+static unsigned int hisi_cppc_cpufreq_get_rate(unsigned int cpu);
+static int cppc_perf_from_fbctrs(struct cppc_cpudata *cpu_data,
+				 struct cppc_perf_fb_ctrs *fb_ctrs_t0,
+				 struct cppc_perf_fb_ctrs *fb_ctrs_t1);
+
+/**
+ * cppc_scale_freq_workfn - CPPC arch_freq_scale updater for frequency invariance
+ * @work: The work item.
+ *
+ * The CPPC driver register itself with the topology core to provide its own
+ * implementation (cppc_scale_freq_tick()) of topology_scale_freq_tick() which
+ * gets called by the scheduler on every tick.
+ *
+ * Note that the arch specific counters have higher priority than CPPC counters,
+ * if available, though the CPPC driver doesn't need to have any special
+ * handling for that.
+ *
+ * On an invocation of cppc_scale_freq_tick(), we schedule an irq work (since we
+ * reach here from hard-irq context), which then schedules a normal work item
+ * and cppc_scale_freq_workfn() updates the per_cpu arch_freq_scale variable
+ * based on the counter updates since the last tick.
+ */
+static void cppc_scale_freq_workfn(struct kthread_work *work)
+{
+	struct cppc_freq_invariance *cppc_fi;
+	struct cppc_perf_fb_ctrs fb_ctrs = {0};
+	struct cppc_cpudata *cpu_data;
+	unsigned long local_freq_scale;
+	u64 perf;
+
+	cppc_fi = container_of(work, struct cppc_freq_invariance, work);
+	cpu_data = cppc_fi->cpu_data;
+
+	if (cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs)) {
+		pr_warn("%s: failed to read perf counters\n", __func__);
+		return;
+	}
+
+	perf = cppc_perf_from_fbctrs(cpu_data, &cppc_fi->prev_perf_fb_ctrs,
+				     &fb_ctrs);
+	cppc_fi->prev_perf_fb_ctrs = fb_ctrs;
+
+	perf <<= SCHED_CAPACITY_SHIFT;
+	local_freq_scale = div64_u64(perf, cpu_data->perf_caps.highest_perf);
+
+	/* This can happen due to counter overflows */
+	if (unlikely(local_freq_scale > 1024))
+		local_freq_scale = 1024;
+
+	per_cpu(arch_freq_scale, cppc_fi->cpu) = local_freq_scale;
+}
+
+static void cppc_irq_work(struct irq_work *irq_work)
+{
+	struct cppc_freq_invariance *cppc_fi;
+
+	cppc_fi = container_of(irq_work, struct cppc_freq_invariance, irq_work);
+	kthread_queue_work(kworker_fie, &cppc_fi->work);
+}
+
+static void cppc_scale_freq_tick(void)
+{
+	struct cppc_freq_invariance *cppc_fi = &per_cpu(cppc_freq_inv, smp_processor_id());
+
+	/*
+	 * cppc_get_perf_ctrs() can potentially sleep, call that from the right
+	 * context.
+	 */
+	irq_work_queue(&cppc_fi->irq_work);
+}
+
+static struct scale_freq_data cppc_sftd = {
+	.source = SCALE_FREQ_SOURCE_CPPC,
+	.set_freq_scale = cppc_scale_freq_tick,
+};
+
+static int cppc_cpufreq_cpu_fie_init(struct cpufreq_policy *policy)
+{
+	struct cppc_freq_invariance *cppc_fi;
+	int cpu, ret;
+
+	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
+		return 0;
+
+	for_each_cpu(cpu, policy->cpus) {
+		cppc_fi = &per_cpu(cppc_freq_inv, cpu);
+		cppc_fi->cpu = cpu;
+		cppc_fi->cpu_data = policy->driver_data;
+		kthread_init_work(&cppc_fi->work, cppc_scale_freq_workfn);
+		init_irq_work(&cppc_fi->irq_work, cppc_irq_work);
+
+		ret = cppc_get_perf_ctrs(cpu, &cppc_fi->prev_perf_fb_ctrs);
+		if (ret) {
+			pr_warn("%s: failed to read perf counters for cpu:%d: %d\n",
+				__func__, cpu, ret);
+
+			/* Avoid failing driver's probe because of FIE */
+			return 0;
+		}
+	}
+
+	/* Register for freq-invariance */
+	topology_set_scale_freq_source(&cppc_sftd, policy->cpus);
+	return 0;
+}
+
+/*
+ * We free all the resources on policy's removal and not on CPU removal as the
+ * irq-work are per-cpu and the hotplug core takes care of flushing the pending
+ * irq-works (hint: smpcfd_dying_cpu()) on CPU hotplug. Even if the kthread-work
+ * fires on another CPU after the concerned CPU is removed, it won't harm.
+ *
+ * We just need to make sure to remove them all on policy->exit().
+ */
+static void cppc_cpufreq_cpu_fie_exit(struct cpufreq_policy *policy)
+{
+	struct cppc_freq_invariance *cppc_fi;
+	int cpu;
+
+	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
+		return;
+
+	/* policy->cpus will be empty here, use related_cpus instead */
+	topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_CPPC, policy->related_cpus);
+
+	for_each_cpu(cpu, policy->related_cpus) {
+		cppc_fi = &per_cpu(cppc_freq_inv, cpu);
+		irq_work_sync(&cppc_fi->irq_work);
+		kthread_cancel_work_sync(&cppc_fi->work);
+	}
+}
+
+static void __init cppc_freq_invariance_init(void)
+{
+	struct sched_attr attr = {
+		.size		= sizeof(struct sched_attr),
+		.sched_policy	= SCHED_DEADLINE,
+		.sched_nice	= 0,
+		.sched_priority	= 0,
+		/*
+		 * Fake (unused) bandwidth; workaround to "fix"
+		 * priority inheritance.
+		 */
+		.sched_runtime	= 1000000,
+		.sched_deadline = 10000000,
+		.sched_period	= 10000000,
+	};
+	int ret;
+
+	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
+		return;
+
+	kworker_fie = kthread_create_worker(0, "cppc_fie");
+	if (IS_ERR(kworker_fie))
+		return;
+
+	ret = sched_setattr_nocheck(kworker_fie->task, &attr);
+	if (ret) {
+		pr_warn("%s: failed to set SCHED_DEADLINE: %d\n", __func__,
+			ret);
+		kthread_destroy_worker(kworker_fie);
+		return;
+	}
+}
+
+static void cppc_freq_invariance_exit(void)
+{
+	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
+		return;
+
+	kthread_destroy_worker(kworker_fie);
+	kworker_fie = NULL;
+}
+
+#else
+static inline int cppc_cpufreq_cpu_fie_init(struct cpufreq_policy *policy)
+{
+	return 0;
+}
+
+static inline void cppc_cpufreq_cpu_fie_exit(struct cpufreq_policy *policy)
+{
+}
+
+static inline void cppc_freq_invariance_init(void)
+{
+}
+
+static inline void cppc_freq_invariance_exit(void)
+{
+}
+#endif /* CONFIG_ACPI_CPPC_CPUFREQ_FIE */
+
 /* Callback function used to retrieve the max frequency from DMI */
 static void cppc_find_dmi_mhz(const struct dmi_header *dm, void *private)
 {
@@ -324,11 +536,13 @@ static int cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
 	cpu_data->perf_ctrls.desired_perf =  caps->highest_perf;
 
 	ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
-	if (ret)
+	if (ret) {
 		pr_debug("Err setting perf value:%d on CPU:%d. ret:%d\n",
 			 caps->highest_perf, cpu, ret);
+		return ret;
+	}
 
-	return ret;
+	return cppc_cpufreq_cpu_fie_init(policy);
 }
 
 static int cppc_cpufreq_cpu_exit(struct cpufreq_policy *policy)
@@ -338,6 +552,8 @@ static int cppc_cpufreq_cpu_exit(struct cpufreq_policy *policy)
 	unsigned int cpu = policy->cpu;
 	int ret;
 
+	cppc_cpufreq_cpu_fie_exit(policy);
+
 	cpu_data->perf_ctrls.desired_perf = caps->lowest_perf;
 
 	ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
@@ -362,26 +578,35 @@ static inline u64 get_delta(u64 t1, u64 t0)
 	return (u32)t1 - (u32)t0;
 }
 
-static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu_data,
-				     struct cppc_perf_fb_ctrs fb_ctrs_t0,
-				     struct cppc_perf_fb_ctrs fb_ctrs_t1)
+static int cppc_perf_from_fbctrs(struct cppc_cpudata *cpu_data,
+				 struct cppc_perf_fb_ctrs *fb_ctrs_t0,
+				 struct cppc_perf_fb_ctrs *fb_ctrs_t1)
 {
 	u64 delta_reference, delta_delivered;
-	u64 reference_perf, delivered_perf;
+	u64 reference_perf;
 
-	reference_perf = fb_ctrs_t0.reference_perf;
+	reference_perf = fb_ctrs_t0->reference_perf;
 
-	delta_reference = get_delta(fb_ctrs_t1.reference,
-				    fb_ctrs_t0.reference);
-	delta_delivered = get_delta(fb_ctrs_t1.delivered,
-				    fb_ctrs_t0.delivered);
+	delta_reference = get_delta(fb_ctrs_t1->reference,
+				    fb_ctrs_t0->reference);
+	delta_delivered = get_delta(fb_ctrs_t1->delivered,
+				    fb_ctrs_t0->delivered);
 
-	/* Check to avoid divide-by zero */
-	if (delta_reference || delta_delivered)
-		delivered_perf = (reference_perf * delta_delivered) /
-					delta_reference;
-	else
-		delivered_perf = cpu_data->perf_ctrls.desired_perf;
+	/* Check to avoid divide-by zero and invalid delivered_perf */
+	if (!delta_reference || !delta_delivered)
+		return cpu_data->perf_ctrls.desired_perf;
+
+	return (reference_perf * delta_delivered) / delta_reference;
+}
+
+static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu_data,
+				     struct cppc_perf_fb_ctrs *fb_ctrs_t0,
+				     struct cppc_perf_fb_ctrs *fb_ctrs_t1)
+{
+	u64 delivered_perf;
+
+	delivered_perf = cppc_perf_from_fbctrs(cpu_data, fb_ctrs_t0,
+					       fb_ctrs_t1);
 
 	return cppc_cpufreq_perf_to_khz(cpu_data, delivered_perf);
 }
@@ -405,7 +630,7 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpu)
 	if (ret)
 		return ret;
 
-	return cppc_get_rate_from_fbctrs(cpu_data, fb_ctrs_t0, fb_ctrs_t1);
+	return cppc_get_rate_from_fbctrs(cpu_data, &fb_ctrs_t0, &fb_ctrs_t1);
 }
 
 static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state)
@@ -506,14 +731,21 @@ static void cppc_check_hisi_workaround(void)
 
 static int __init cppc_cpufreq_init(void)
 {
+	int ret;
+
 	if ((acpi_disabled) || !acpi_cpc_valid())
 		return -ENODEV;
 
 	INIT_LIST_HEAD(&cpu_data_list);
 
 	cppc_check_hisi_workaround();
+	cppc_freq_invariance_init();
 
-	return cpufreq_register_driver(&cppc_cpufreq_driver);
+	ret = cpufreq_register_driver(&cppc_cpufreq_driver);
+	if (ret)
+		cppc_freq_invariance_exit();
+
+	return ret;
 }
 
 static inline void free_cpu_data(void)
@@ -531,6 +763,7 @@ static inline void free_cpu_data(void)
 static void __exit cppc_cpufreq_exit(void)
 {
 	cpufreq_unregister_driver(&cppc_cpufreq_driver);
+	cppc_freq_invariance_exit();
 
 	free_cpu_data();
 }
Ionela Voinescu June 18, 2021, 12:26 p.m. UTC | #7
Hey,

On Friday 18 Jun 2021 at 13:07:13 (+0530), Viresh Kumar wrote:
> On 17-06-21, 11:34, Ionela Voinescu wrote:
> > We can't queue any more work for it as there's no tick for the offline
> > CPU.
> 
> Okay, so as discussed yesterday and confirmed by Peter, CPU hotplug shouldn't be
> a problem and we can avoid handling that here. Good.
> 
> The patch 1/3 from this series is not required anymore. I will get rid of
> stop_cpu() as well.
> 
> The second patch stays as is, as I still don't see another way of fixing the
> same problem on policy ->exit(). We still need that guarantee from topology
> core.
>

Right! That is because we need to make sure the tick does not queue any
more irq work while we do our cleanup (irq_work_sync() and then
kthread_cancel_work_sync()). Then kthread_cancel_work_sync() would
ensure the last worker wraps up so we can free the data.

I was hoping to avoid this but I don't currently see another way either.

> > P.S. I'll give more thought to the rcu use in the arch_topology driver.
> > I'm the boring kind that likes to err on the side of caution, so I tend
> > to agree that it might be good to have even if the current problem could
> > be solved in this driver.
> 
> Before I resend the series again, maybe it is better to align on the idea here
> itself first (as I need to fix some existing potential memleaks in policy
> ->init() first). So here is the new diff, looks okay now ?
> 
> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
> index be4f62e2c5f1..3e9070f107c5 100644
> --- a/drivers/cpufreq/cppc_cpufreq.c
> +++ b/drivers/cpufreq/cppc_cpufreq.c
> @@ -10,14 +10,18 @@
>  
>  #define pr_fmt(fmt)	"CPPC Cpufreq:"	fmt
>  
> +#include <linux/arch_topology.h>
>  #include <linux/kernel.h>
>  #include <linux/module.h>
>  #include <linux/delay.h>
>  #include <linux/cpu.h>
>  #include <linux/cpufreq.h>
>  #include <linux/dmi.h>
> +#include <linux/irq_work.h>
> +#include <linux/kthread.h>
>  #include <linux/time.h>
>  #include <linux/vmalloc.h>
> +#include <uapi/linux/sched/types.h>
>  
>  #include <asm/unaligned.h>
>  
> @@ -57,6 +61,214 @@ static struct cppc_workaround_oem_info wa_info[] = {
>  	}
>  };
>  
> +#ifdef CONFIG_ACPI_CPPC_CPUFREQ_FIE
> +
> +/* Frequency invariance support */
> +struct cppc_freq_invariance {
> +	int cpu;
> +	struct irq_work irq_work;
> +	struct kthread_work work;
> +	struct cppc_perf_fb_ctrs prev_perf_fb_ctrs;
> +	struct cppc_cpudata *cpu_data;
> +};
> +
> +static DEFINE_PER_CPU(struct cppc_freq_invariance, cppc_freq_inv);
> +static struct kthread_worker *kworker_fie;
> +
> +static struct cpufreq_driver cppc_cpufreq_driver;
> +static unsigned int hisi_cppc_cpufreq_get_rate(unsigned int cpu);
> +static int cppc_perf_from_fbctrs(struct cppc_cpudata *cpu_data,
> +				 struct cppc_perf_fb_ctrs *fb_ctrs_t0,
> +				 struct cppc_perf_fb_ctrs *fb_ctrs_t1);
> +
> +/**
> + * cppc_scale_freq_workfn - CPPC arch_freq_scale updater for frequency invariance
> + * @work: The work item.
> + *
> + * The CPPC driver register itself with the topology core to provide its own
> + * implementation (cppc_scale_freq_tick()) of topology_scale_freq_tick() which
> + * gets called by the scheduler on every tick.
> + *
> + * Note that the arch specific counters have higher priority than CPPC counters,
> + * if available, though the CPPC driver doesn't need to have any special
> + * handling for that.
> + *
> + * On an invocation of cppc_scale_freq_tick(), we schedule an irq work (since we
> + * reach here from hard-irq context), which then schedules a normal work item
> + * and cppc_scale_freq_workfn() updates the per_cpu arch_freq_scale variable
> + * based on the counter updates since the last tick.
> + */
> +static void cppc_scale_freq_workfn(struct kthread_work *work)
> +{
> +	struct cppc_freq_invariance *cppc_fi;
> +	struct cppc_perf_fb_ctrs fb_ctrs = {0};
> +	struct cppc_cpudata *cpu_data;
> +	unsigned long local_freq_scale;
> +	u64 perf;
> +
> +	cppc_fi = container_of(work, struct cppc_freq_invariance, work);
> +	cpu_data = cppc_fi->cpu_data;
> +
> +	if (cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs)) {
> +		pr_warn("%s: failed to read perf counters\n", __func__);
> +		return;
> +	}
> +
> +	perf = cppc_perf_from_fbctrs(cpu_data, &cppc_fi->prev_perf_fb_ctrs,
> +				     &fb_ctrs);
> +	cppc_fi->prev_perf_fb_ctrs = fb_ctrs;
> +
> +	perf <<= SCHED_CAPACITY_SHIFT;
> +	local_freq_scale = div64_u64(perf, cpu_data->perf_caps.highest_perf);
> +
> +	/* This can happen due to counter overflows */
> +	if (unlikely(local_freq_scale > 1024))
> +		local_freq_scale = 1024;
> +
> +	per_cpu(arch_freq_scale, cppc_fi->cpu) = local_freq_scale;
> +}
> +
> +static void cppc_irq_work(struct irq_work *irq_work)
> +{
> +	struct cppc_freq_invariance *cppc_fi;
> +
> +	cppc_fi = container_of(irq_work, struct cppc_freq_invariance, irq_work);
> +	kthread_queue_work(kworker_fie, &cppc_fi->work);
> +}
> +
> +static void cppc_scale_freq_tick(void)
> +{
> +	struct cppc_freq_invariance *cppc_fi = &per_cpu(cppc_freq_inv, smp_processor_id());
> +
> +	/*
> +	 * cppc_get_perf_ctrs() can potentially sleep, call that from the right
> +	 * context.
> +	 */
> +	irq_work_queue(&cppc_fi->irq_work);
> +}
> +
> +static struct scale_freq_data cppc_sftd = {
> +	.source = SCALE_FREQ_SOURCE_CPPC,
> +	.set_freq_scale = cppc_scale_freq_tick,
> +};
> +
> +static int cppc_cpufreq_cpu_fie_init(struct cpufreq_policy *policy)
> +{
> +	struct cppc_freq_invariance *cppc_fi;
> +	int cpu, ret;
> +
> +	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
> +		return 0;
> +
> +	for_each_cpu(cpu, policy->cpus) {
> +		cppc_fi = &per_cpu(cppc_freq_inv, cpu);
> +		cppc_fi->cpu = cpu;
> +		cppc_fi->cpu_data = policy->driver_data;
> +		kthread_init_work(&cppc_fi->work, cppc_scale_freq_workfn);
> +		init_irq_work(&cppc_fi->irq_work, cppc_irq_work);
> +
> +		ret = cppc_get_perf_ctrs(cpu, &cppc_fi->prev_perf_fb_ctrs);
> +		if (ret) {
> +			pr_warn("%s: failed to read perf counters for cpu:%d: %d\n",
> +				__func__, cpu, ret);
> +
> +			/* Avoid failing driver's probe because of FIE */
> +			return 0;
> +		}
> +	}
> +
> +	/* Register for freq-invariance */
> +	topology_set_scale_freq_source(&cppc_sftd, policy->cpus);
> +	return 0;
> +}
> +
> +/*
> + * We free all the resources on policy's removal and not on CPU removal as the
> + * irq-work are per-cpu and the hotplug core takes care of flushing the pending
> + * irq-works (hint: smpcfd_dying_cpu()) on CPU hotplug. Even if the kthread-work
> + * fires on another CPU after the concerned CPU is removed, it won't harm.
> + *
> + * We just need to make sure to remove them all on policy->exit().
> + */
> +static void cppc_cpufreq_cpu_fie_exit(struct cpufreq_policy *policy)
> +{
> +	struct cppc_freq_invariance *cppc_fi;
> +	int cpu;
> +
> +	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
> +		return;
> +
> +	/* policy->cpus will be empty here, use related_cpus instead */
> +	topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_CPPC, policy->related_cpus);
> +
> +	for_each_cpu(cpu, policy->related_cpus) {
> +		cppc_fi = &per_cpu(cppc_freq_inv, cpu);
> +		irq_work_sync(&cppc_fi->irq_work);
> +		kthread_cancel_work_sync(&cppc_fi->work);
> +	}
> +}

Yep, looks good (with its call on cppc_cpufreq_cpu_exit()).

Feel free to post V3 and I'll take an even closer look during testing.

Thanks,
Ionela.


> +
> +static void __init cppc_freq_invariance_init(void)
> +{
> +	struct sched_attr attr = {
> +		.size		= sizeof(struct sched_attr),
> +		.sched_policy	= SCHED_DEADLINE,
> +		.sched_nice	= 0,
> +		.sched_priority	= 0,
> +		/*
> +		 * Fake (unused) bandwidth; workaround to "fix"
> +		 * priority inheritance.
> +		 */
> +		.sched_runtime	= 1000000,
> +		.sched_deadline = 10000000,
> +		.sched_period	= 10000000,
> +	};
> +	int ret;
> +
> +	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
> +		return;
> +
> +	kworker_fie = kthread_create_worker(0, "cppc_fie");
> +	if (IS_ERR(kworker_fie))
> +		return;
> +
> +	ret = sched_setattr_nocheck(kworker_fie->task, &attr);
> +	if (ret) {
> +		pr_warn("%s: failed to set SCHED_DEADLINE: %d\n", __func__,
> +			ret);
> +		kthread_destroy_worker(kworker_fie);
> +		return;
> +	}
> +}
> +
> +static void cppc_freq_invariance_exit(void)
> +{
> +	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
> +		return;
> +
> +	kthread_destroy_worker(kworker_fie);
> +	kworker_fie = NULL;
> +}
> +
> +#else
> +static inline int cppc_cpufreq_cpu_fie_init(struct cpufreq_policy *policy)
> +{
> +	return 0;
> +}
> +
> +static inline void cppc_cpufreq_cpu_fie_exit(struct cpufreq_policy *policy)
> +{
> +}
> +
> +static inline void cppc_freq_invariance_init(void)
> +{
> +}
> +
> +static inline void cppc_freq_invariance_exit(void)
> +{
> +}
> +#endif /* CONFIG_ACPI_CPPC_CPUFREQ_FIE */
> +
>  /* Callback function used to retrieve the max frequency from DMI */
>  static void cppc_find_dmi_mhz(const struct dmi_header *dm, void *private)
>  {
> @@ -324,11 +536,13 @@ static int cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
>  	cpu_data->perf_ctrls.desired_perf =  caps->highest_perf;
>  
>  	ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
> -	if (ret)
> +	if (ret) {
>  		pr_debug("Err setting perf value:%d on CPU:%d. ret:%d\n",
>  			 caps->highest_perf, cpu, ret);
> +		return ret;
> +	}
>  
> -	return ret;
> +	return cppc_cpufreq_cpu_fie_init(policy);
>  }
>  
>  static int cppc_cpufreq_cpu_exit(struct cpufreq_policy *policy)
> @@ -338,6 +552,8 @@ static int cppc_cpufreq_cpu_exit(struct cpufreq_policy *policy)
>  	unsigned int cpu = policy->cpu;
>  	int ret;
>  
> +	cppc_cpufreq_cpu_fie_exit(policy);
> +
>  	cpu_data->perf_ctrls.desired_perf = caps->lowest_perf;
>  
>  	ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
> @@ -362,26 +578,35 @@ static inline u64 get_delta(u64 t1, u64 t0)
>  	return (u32)t1 - (u32)t0;
>  }
>  
> -static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu_data,
> -				     struct cppc_perf_fb_ctrs fb_ctrs_t0,
> -				     struct cppc_perf_fb_ctrs fb_ctrs_t1)
> +static int cppc_perf_from_fbctrs(struct cppc_cpudata *cpu_data,
> +				 struct cppc_perf_fb_ctrs *fb_ctrs_t0,
> +				 struct cppc_perf_fb_ctrs *fb_ctrs_t1)
>  {
>  	u64 delta_reference, delta_delivered;
> -	u64 reference_perf, delivered_perf;
> +	u64 reference_perf;
>  
> -	reference_perf = fb_ctrs_t0.reference_perf;
> +	reference_perf = fb_ctrs_t0->reference_perf;
>  
> -	delta_reference = get_delta(fb_ctrs_t1.reference,
> -				    fb_ctrs_t0.reference);
> -	delta_delivered = get_delta(fb_ctrs_t1.delivered,
> -				    fb_ctrs_t0.delivered);
> +	delta_reference = get_delta(fb_ctrs_t1->reference,
> +				    fb_ctrs_t0->reference);
> +	delta_delivered = get_delta(fb_ctrs_t1->delivered,
> +				    fb_ctrs_t0->delivered);
>  
> -	/* Check to avoid divide-by zero */
> -	if (delta_reference || delta_delivered)
> -		delivered_perf = (reference_perf * delta_delivered) /
> -					delta_reference;
> -	else
> -		delivered_perf = cpu_data->perf_ctrls.desired_perf;
> +	/* Check to avoid divide-by zero and invalid delivered_perf */
> +	if (!delta_reference || !delta_delivered)
> +		return cpu_data->perf_ctrls.desired_perf;
> +
> +	return (reference_perf * delta_delivered) / delta_reference;
> +}
> +
> +static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu_data,
> +				     struct cppc_perf_fb_ctrs *fb_ctrs_t0,
> +				     struct cppc_perf_fb_ctrs *fb_ctrs_t1)
> +{
> +	u64 delivered_perf;
> +
> +	delivered_perf = cppc_perf_from_fbctrs(cpu_data, fb_ctrs_t0,
> +					       fb_ctrs_t1);
>  
>  	return cppc_cpufreq_perf_to_khz(cpu_data, delivered_perf);
>  }
> @@ -405,7 +630,7 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpu)
>  	if (ret)
>  		return ret;
>  
> -	return cppc_get_rate_from_fbctrs(cpu_data, fb_ctrs_t0, fb_ctrs_t1);
> +	return cppc_get_rate_from_fbctrs(cpu_data, &fb_ctrs_t0, &fb_ctrs_t1);
>  }
>  
>  static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state)
> @@ -506,14 +731,21 @@ static void cppc_check_hisi_workaround(void)
>  
>  static int __init cppc_cpufreq_init(void)
>  {
> +	int ret;
> +
>  	if ((acpi_disabled) || !acpi_cpc_valid())
>  		return -ENODEV;
>  
>  	INIT_LIST_HEAD(&cpu_data_list);
>  
>  	cppc_check_hisi_workaround();
> +	cppc_freq_invariance_init();
>  
> -	return cpufreq_register_driver(&cppc_cpufreq_driver);
> +	ret = cpufreq_register_driver(&cppc_cpufreq_driver);
> +	if (ret)
> +		cppc_freq_invariance_exit();
> +
> +	return ret;
>  }
>  
>  static inline void free_cpu_data(void)
> @@ -531,6 +763,7 @@ static inline void free_cpu_data(void)
>  static void __exit cppc_cpufreq_exit(void)
>  {
>  	cpufreq_unregister_driver(&cppc_cpufreq_driver);
> +	cppc_freq_invariance_exit();
>  
>  	free_cpu_data();
>  }
> 
> -- 
> viresh