diff mbox series

[V2,3/3] cpufreq: CPPC: Add support for frequency invariance

Message ID e7e653ede3ef54acc906d2bde47a3b9a41533404.1623825725.git.viresh.kumar@linaro.org
State New
Headers show
Series cpufreq: cppc: Add support for frequency invariance | expand

Commit Message

Viresh Kumar June 16, 2021, 6:48 a.m. UTC
The Frequency Invariance Engine (FIE) is providing a frequency scaling
correction factor that helps achieve more accurate load-tracking.

Normally, this scaling factor can be obtained directly with the help of
the cpufreq drivers as they know the exact frequency the hardware is
running at. But that isn't the case for CPPC cpufreq driver.

Another way of obtaining that is using the arch specific counter
support, which is already present in kernel, but that hardware is
optional for platforms.

This patch updates the CPPC driver to register itself with the topology
core to provide its own implementation (cppc_scale_freq_tick()) of
topology_scale_freq_tick() which gets called by the scheduler on every
tick. Note that the arch specific counters have higher priority than
CPPC counters, if available, though the CPPC driver doesn't need to have
any special handling for that.

On an invocation of cppc_scale_freq_tick(), we schedule an irq work
(since we reach here from hard-irq context), which then schedules a
normal work item and cppc_scale_freq_workfn() updates the per_cpu
arch_freq_scale variable based on the counter updates since the last
tick.

To allow platforms to disable this CPPC counter-based frequency
invariance support, this is all done under CONFIG_ACPI_CPPC_CPUFREQ_FIE,
which is enabled by default.

This also exports sched_setattr_nocheck() as the CPPC driver can be
built as a module.

Cc: linux-acpi@vger.kernel.org
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

---
 drivers/cpufreq/Kconfig.arm    |  10 ++
 drivers/cpufreq/cppc_cpufreq.c | 232 ++++++++++++++++++++++++++++++---
 include/linux/arch_topology.h  |   1 +
 kernel/sched/core.c            |   1 +
 4 files changed, 227 insertions(+), 17 deletions(-)

-- 
2.31.1.272.g89b43f80a514

Comments

Ionela Voinescu June 16, 2021, 12:48 p.m. UTC | #1
Hi,

I was looking forward to the complete removal of stop_cpu() :).

On Wednesday 16 Jun 2021 at 12:18:09 (+0530), Viresh Kumar wrote:
> The Frequency Invariance Engine (FIE) is providing a frequency scaling

> correction factor that helps achieve more accurate load-tracking.

> 

> Normally, this scaling factor can be obtained directly with the help of

> the cpufreq drivers as they know the exact frequency the hardware is

> running at. But that isn't the case for CPPC cpufreq driver.

> 

> Another way of obtaining that is using the arch specific counter

> support, which is already present in kernel, but that hardware is

> optional for platforms.

> 

> This patch updates the CPPC driver to register itself with the topology

> core to provide its own implementation (cppc_scale_freq_tick()) of

> topology_scale_freq_tick() which gets called by the scheduler on every

> tick. Note that the arch specific counters have higher priority than

> CPPC counters, if available, though the CPPC driver doesn't need to have

> any special handling for that.

> 

> On an invocation of cppc_scale_freq_tick(), we schedule an irq work

> (since we reach here from hard-irq context), which then schedules a

> normal work item and cppc_scale_freq_workfn() updates the per_cpu

> arch_freq_scale variable based on the counter updates since the last

> tick.

> 

> To allow platforms to disable this CPPC counter-based frequency

> invariance support, this is all done under CONFIG_ACPI_CPPC_CPUFREQ_FIE,

> which is enabled by default.

> 

> This also exports sched_setattr_nocheck() as the CPPC driver can be

> built as a module.

> 

> Cc: linux-acpi@vger.kernel.org

> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

> ---

>  drivers/cpufreq/Kconfig.arm    |  10 ++

>  drivers/cpufreq/cppc_cpufreq.c | 232 ++++++++++++++++++++++++++++++---

>  include/linux/arch_topology.h  |   1 +

>  kernel/sched/core.c            |   1 +

>  4 files changed, 227 insertions(+), 17 deletions(-)

> 

[..]
> +static void cppc_cpufreq_start_cpu(struct cpufreq_policy *policy,

> +				   unsigned int cpu)

> +{

> +	struct cppc_freq_invariance *cppc_fi = &per_cpu(cppc_freq_inv, cpu);

> +	int ret;

> +

> +	cppc_fi->cpu = cpu;

> +	cppc_fi->cpu_data = policy->driver_data;

> +	kthread_init_work(&cppc_fi->work, cppc_scale_freq_workfn);

> +	init_irq_work(&cppc_fi->irq_work, cppc_irq_work);

> +

> +	ret = cppc_get_perf_ctrs(cpu, &cppc_fi->prev_perf_fb_ctrs);

> +	if (ret) {

> +		pr_warn("%s: failed to read perf counters: %d\n", __func__,

> +			ret);

> +		return;

> +	}

> +

> +	/* Register for freq-invariance */

> +	topology_set_scale_freq_source(&cppc_sftd, cpumask_of(cpu));

> +}

> +

> +static void cppc_cpufreq_stop_cpu(struct cpufreq_policy *policy,

> +				  unsigned int cpu)

> +{

> +	struct cppc_freq_invariance *cppc_fi = &per_cpu(cppc_freq_inv, cpu);

> +

> +	topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_CPPC, cpumask_of(cpu));

> +

> +	irq_work_sync(&cppc_fi->irq_work);

> +	kthread_cancel_work_sync(&cppc_fi->work);

> +}


I'll only comment on this for now as I should know the rest.

Let's assume we don't have these, what happens now is the following:

1. We hotplug out the last CPU in a policy, we call the
   .stop_cpu()/exit() function which will free the cppc_cpudata structure.

   The only vulnerability is if we have a last tick on that last CPU,
   after the above callback was called.

2. When the CPU at 1. gets hotplugged back in, the cppc_fi->cpu_data is
   stale.

We do not have a problem when removing the CPPC cpufreq module as we're
doing cppc_freq_invariance_exit() before unregistering the driver and
freeing the data.

Are 1. and 2 the only problems we have, or have I missed any?

Thanks,
Ionela.
Ionela Voinescu June 17, 2021, 10:34 a.m. UTC | #2
Many thanks for the details!

On Thursday 17 Jun 2021 at 08:54:16 (+0530), Viresh Kumar wrote:
> On 16-06-21, 13:48, Ionela Voinescu wrote:

> > I was looking forward to the complete removal of stop_cpu() :).

> 

> No one wants to remove more code than I do :)

> 

> > I'll only comment on this for now as I should know the rest.

> > 

> > Let's assume we don't have these, what happens now is the following:

> > 

> > 1. We hotplug out the last CPU in a policy, we call the

> >    .stop_cpu()/exit() function which will free the cppc_cpudata structure.

> > 

> >    The only vulnerability is if we have a last tick on that last CPU,

> >    after the above callback was called.

> > 

> > 2. When the CPU at 1. gets hotplugged back in, the cppc_fi->cpu_data is

> >    stale.

> > 

> > We do not have a problem when removing the CPPC cpufreq module as we're

> > doing cppc_freq_invariance_exit() before unregistering the driver and

> > freeing the data.

> > 

> > Are 1. and 2 the only problems we have, or have I missed any?

> 

> There is more to it. For simplicity, lets assume a quad-core setup,

> with all 4 CPUs sharing the cpufreq policy. And here is what happens

> without the new changes:

> 

> - On CPPC cpufreq driver insertion, we start 4 kthreads/irq-works (1

>   for each CPU as it fires from tick) from the ->init() callback.

> 

> - Now lets say we offline CPU3. The CPPC cpufreq driver isn't notified

>   by anyone and it hasn't registered itself to hotplug notifier as

>   well. So, the irq-work/kthread isn't stopped. This results in the

>   issue reported by Qian earlier.

> 


I might be missing something, but when you offline a single CPU in a
policy, the worse that can happen is that a last call to
cppc_scale_freq_tick() would have sneaked in before irqs and the tick
are disabled. But even if we have a last call to
cppc_scale_freq_workfn(), the counter read methods would know how to
cope with hotplug, and the cppc_cpudata structure would still be
allocated and have valid desired_perf and highest_perf values.

Worse case, the last scale factor set for the CPU will be meaningless,
but it's already meaningless as the CPU is going down.

When you are referring to the issue reported by Qian I suppose you are
referring to this [1]. I think this is the case where you hotplug the
last CPU in a policy and free cppc_cpudata.

[1] https://lore.kernel.org/linux-pm/41f5195e-0e5f-fdfe-ba37-34e1fd8e4064@quicinc.com/

>   The very same thing happens with schedutil governor too, which uses

>   very similar mechanisms, and the cpufreq core takes care of it there

>   by stopping the governor before removing the CPU from policy->cpus

>   and starting it again. So there we stop irq-work/kthread for all 4

>   CPUs, then start them only for remaining 3.

> 


Things are different for sugov: you have to stop the governor when one
CPU in the policy goes down, and it's normal for sugov not to allow its
hooks to be called while the governor is down so it has to do a full
cleanup when going down and a full bringup when going back up.

The difference for CPPC invariance is that only a CPU can trigger the
work to update its own scale factor, through the tick. No other CPU x
can trigger a scale factor update for CPU y, but x can carry out the
work for CPU y (x can run the cppc_scale_freq_workfn(y)).

So when y goes down, it won't have a tick to queue any irq or kthread
work any longer until it's brought back up. So I believe that the only
cleanup that needs to be done when a CPU goes offline, is to ensure
that the work triggered by that last tick is done safely.

>   I thought about that approach as well, but it was too heavy to stop

>   everyone and start again in this case. And so I invented start_cpu()

>   and stop_cpu() callbacks.

> 


> - In this case, because the CPU is going away, we need to make sure we

>   don't queue any more irq-work or kthread to it and this is one of

>   the main reasons for adding synchronization in the topology core,

>   because we need a hard guarantee here that irq-work won't fire

>   again, as the CPU won't be there or will not be in a sane state.

> 


We can't queue any more work for it as there's no tick for the offline
CPU.

> - The same sequence of events is true for the case where the last CPU

>   of a policy goes away (not in this example, but lets say quad-core

>   setup with separate policies for each CPU).

> 

> - Not just the policy, but the CPU may be going away as well.

> 

> I hope I was able to clarify a bit here.

> 


Thanks! I do agree it is better to be cautious, but I initially wanted to
understand we don't see the problem bigger than it actually is.

Thanks,
Ionela.

P.S. I'll give more thought to the rcu use in the arch_topology driver.
I'm the boring kind that likes to err on the side of caution, so I tend
to agree that it might be good to have even if the current problem could
be solved in this driver.


> -- 

> viresh
Viresh Kumar June 17, 2021, 11:19 a.m. UTC | #3
On 17-06-21, 11:34, Ionela Voinescu wrote:
> I might be missing something, but when you offline a single CPU in a

> policy, the worse that can happen is that a last call to

> cppc_scale_freq_tick() would have sneaked in before irqs and the tick

> are disabled. But even if we have a last call to

> cppc_scale_freq_workfn(), the counter read methods would know how to

> cope with hotplug, and the cppc_cpudata structure would still be

> allocated and have valid desired_perf and highest_perf values.


Okay, I somehow assumed that cppc_scale_freq_workfn() needs to run on the local
CPU, while it can actually land anywhere. My fault.

But the irq-work being queued here is per-cpu and it will get queued on the
local CPU where the tick occurred.

Now I am not sure of what will happen to this irq-work in that case. I tried to
look now and I see that these irq-work items are processed first on tick and
then the tick handler of scheduler is called, so that will again queue the cppc
irq-work.

What happens if this happens along with CPU hotplug ? I am not sure I understand
that. There may or may not be any side effects of this. Lets assume the work
item is left in the queue as is and no tick happens after that as the CPU is
offlined. We are good.

Now if we unload the cpufreq driver at this moment, the driver will call
irq_work_sync(), which may end up in a while loop ? There is no
irq-work-cancel() API.

Peter: Can you help here on this ? Lemme try to explain a bit here:

We are starting an irq-work (in cppc cpufreq driver) from
scheduler_tick()->arch_scale_freq_tick(). What will happen if the driver doesn't
take care of CPU hotplug explicitly and make sure this work isn't queued again
from the next tick.

Is it important for user to make sure it gets rid of the irq-work during hotplug
here ?

> Worse case, the last scale factor set for the CPU will be meaningless,

> but it's already meaningless as the CPU is going down.

> 

> When you are referring to the issue reported by Qian I suppose you are

> referring to this [1]. I think this is the case where you hotplug the

> last CPU in a policy and free cppc_cpudata.

> 

> [1] https://lore.kernel.org/linux-pm/41f5195e-0e5f-fdfe-ba37-34e1fd8e4064@quicinc.com/


Yes, I was talking about this report only, I am not sure now if I understand
what actually happened here :)

Ionela: I have skipped replying to rest of your email, will get back to that
once we have clarification on the above first.

Thanks a lot for your reviews, always on time :)

-- 
viresh
Viresh Kumar June 18, 2021, 3:45 a.m. UTC | #4
On 17-06-21, 14:22, Peter Zijlstra wrote:
> On Thu, Jun 17, 2021 at 04:49:36PM +0530, Viresh Kumar wrote:
> > Peter: Can you help here on this ? Lemme try to explain a bit here:
> > 
> > We are starting an irq-work (in cppc cpufreq driver) from
> > scheduler_tick()->arch_scale_freq_tick(). What will happen if the driver doesn't
> > take care of CPU hotplug explicitly and make sure this work isn't queued again
> > from the next tick.
> > 
> > Is it important for user to make sure it gets rid of the irq-work during hotplug
> > here ?
> 
> irq-work is flushed on hotplug, see smpcfd_dying_cpu().

Thanks Peter.
Viresh Kumar June 18, 2021, 7:37 a.m. UTC | #5
On 17-06-21, 11:34, Ionela Voinescu wrote:
> We can't queue any more work for it as there's no tick for the offline

> CPU.


Okay, so as discussed yesterday and confirmed by Peter, CPU hotplug shouldn't be
a problem and we can avoid handling that here. Good.

The patch 1/3 from this series is not required anymore. I will get rid of
stop_cpu() as well.

The second patch stays as is, as I still don't see another way of fixing the
same problem on policy ->exit(). We still need that guarantee from topology
core.

> P.S. I'll give more thought to the rcu use in the arch_topology driver.

> I'm the boring kind that likes to err on the side of caution, so I tend

> to agree that it might be good to have even if the current problem could

> be solved in this driver.


Before I resend the series again, maybe it is better to align on the idea here
itself first (as I need to fix some existing potential memleaks in policy
->init() first). So here is the new diff, looks okay now ?

diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
index be4f62e2c5f1..3e9070f107c5 100644
--- a/drivers/cpufreq/cppc_cpufreq.c
+++ b/drivers/cpufreq/cppc_cpufreq.c
@@ -10,14 +10,18 @@
 
 #define pr_fmt(fmt)	"CPPC Cpufreq:"	fmt
 
+#include <linux/arch_topology.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/delay.h>
 #include <linux/cpu.h>
 #include <linux/cpufreq.h>
 #include <linux/dmi.h>
+#include <linux/irq_work.h>
+#include <linux/kthread.h>
 #include <linux/time.h>
 #include <linux/vmalloc.h>
+#include <uapi/linux/sched/types.h>
 
 #include <asm/unaligned.h>
 
@@ -57,6 +61,214 @@ static struct cppc_workaround_oem_info wa_info[] = {
 	}
 };
 
+#ifdef CONFIG_ACPI_CPPC_CPUFREQ_FIE
+
+/* Frequency invariance support */
+struct cppc_freq_invariance {
+	int cpu;
+	struct irq_work irq_work;
+	struct kthread_work work;
+	struct cppc_perf_fb_ctrs prev_perf_fb_ctrs;
+	struct cppc_cpudata *cpu_data;
+};
+
+static DEFINE_PER_CPU(struct cppc_freq_invariance, cppc_freq_inv);
+static struct kthread_worker *kworker_fie;
+
+static struct cpufreq_driver cppc_cpufreq_driver;
+static unsigned int hisi_cppc_cpufreq_get_rate(unsigned int cpu);
+static int cppc_perf_from_fbctrs(struct cppc_cpudata *cpu_data,
+				 struct cppc_perf_fb_ctrs *fb_ctrs_t0,
+				 struct cppc_perf_fb_ctrs *fb_ctrs_t1);
+
+/**
+ * cppc_scale_freq_workfn - CPPC arch_freq_scale updater for frequency invariance
+ * @work: The work item.
+ *
+ * The CPPC driver register itself with the topology core to provide its own
+ * implementation (cppc_scale_freq_tick()) of topology_scale_freq_tick() which
+ * gets called by the scheduler on every tick.
+ *
+ * Note that the arch specific counters have higher priority than CPPC counters,
+ * if available, though the CPPC driver doesn't need to have any special
+ * handling for that.
+ *
+ * On an invocation of cppc_scale_freq_tick(), we schedule an irq work (since we
+ * reach here from hard-irq context), which then schedules a normal work item
+ * and cppc_scale_freq_workfn() updates the per_cpu arch_freq_scale variable
+ * based on the counter updates since the last tick.
+ */
+static void cppc_scale_freq_workfn(struct kthread_work *work)
+{
+	struct cppc_freq_invariance *cppc_fi;
+	struct cppc_perf_fb_ctrs fb_ctrs = {0};
+	struct cppc_cpudata *cpu_data;
+	unsigned long local_freq_scale;
+	u64 perf;
+
+	cppc_fi = container_of(work, struct cppc_freq_invariance, work);
+	cpu_data = cppc_fi->cpu_data;
+
+	if (cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs)) {
+		pr_warn("%s: failed to read perf counters\n", __func__);
+		return;
+	}
+
+	perf = cppc_perf_from_fbctrs(cpu_data, &cppc_fi->prev_perf_fb_ctrs,
+				     &fb_ctrs);
+	cppc_fi->prev_perf_fb_ctrs = fb_ctrs;
+
+	perf <<= SCHED_CAPACITY_SHIFT;
+	local_freq_scale = div64_u64(perf, cpu_data->perf_caps.highest_perf);
+
+	/* This can happen due to counter overflows */
+	if (unlikely(local_freq_scale > 1024))
+		local_freq_scale = 1024;
+
+	per_cpu(arch_freq_scale, cppc_fi->cpu) = local_freq_scale;
+}
+
+static void cppc_irq_work(struct irq_work *irq_work)
+{
+	struct cppc_freq_invariance *cppc_fi;
+
+	cppc_fi = container_of(irq_work, struct cppc_freq_invariance, irq_work);
+	kthread_queue_work(kworker_fie, &cppc_fi->work);
+}
+
+static void cppc_scale_freq_tick(void)
+{
+	struct cppc_freq_invariance *cppc_fi = &per_cpu(cppc_freq_inv, smp_processor_id());
+
+	/*
+	 * cppc_get_perf_ctrs() can potentially sleep, call that from the right
+	 * context.
+	 */
+	irq_work_queue(&cppc_fi->irq_work);
+}
+
+static struct scale_freq_data cppc_sftd = {
+	.source = SCALE_FREQ_SOURCE_CPPC,
+	.set_freq_scale = cppc_scale_freq_tick,
+};
+
+static int cppc_cpufreq_cpu_fie_init(struct cpufreq_policy *policy)
+{
+	struct cppc_freq_invariance *cppc_fi;
+	int cpu, ret;
+
+	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
+		return 0;
+
+	for_each_cpu(cpu, policy->cpus) {
+		cppc_fi = &per_cpu(cppc_freq_inv, cpu);
+		cppc_fi->cpu = cpu;
+		cppc_fi->cpu_data = policy->driver_data;
+		kthread_init_work(&cppc_fi->work, cppc_scale_freq_workfn);
+		init_irq_work(&cppc_fi->irq_work, cppc_irq_work);
+
+		ret = cppc_get_perf_ctrs(cpu, &cppc_fi->prev_perf_fb_ctrs);
+		if (ret) {
+			pr_warn("%s: failed to read perf counters for cpu:%d: %d\n",
+				__func__, cpu, ret);
+
+			/* Avoid failing driver's probe because of FIE */
+			return 0;
+		}
+	}
+
+	/* Register for freq-invariance */
+	topology_set_scale_freq_source(&cppc_sftd, policy->cpus);
+	return 0;
+}
+
+/*
+ * We free all the resources on policy's removal and not on CPU removal as the
+ * irq-work are per-cpu and the hotplug core takes care of flushing the pending
+ * irq-works (hint: smpcfd_dying_cpu()) on CPU hotplug. Even if the kthread-work
+ * fires on another CPU after the concerned CPU is removed, it won't harm.
+ *
+ * We just need to make sure to remove them all on policy->exit().
+ */
+static void cppc_cpufreq_cpu_fie_exit(struct cpufreq_policy *policy)
+{
+	struct cppc_freq_invariance *cppc_fi;
+	int cpu;
+
+	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
+		return;
+
+	/* policy->cpus will be empty here, use related_cpus instead */
+	topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_CPPC, policy->related_cpus);
+
+	for_each_cpu(cpu, policy->related_cpus) {
+		cppc_fi = &per_cpu(cppc_freq_inv, cpu);
+		irq_work_sync(&cppc_fi->irq_work);
+		kthread_cancel_work_sync(&cppc_fi->work);
+	}
+}
+
+static void __init cppc_freq_invariance_init(void)
+{
+	struct sched_attr attr = {
+		.size		= sizeof(struct sched_attr),
+		.sched_policy	= SCHED_DEADLINE,
+		.sched_nice	= 0,
+		.sched_priority	= 0,
+		/*
+		 * Fake (unused) bandwidth; workaround to "fix"
+		 * priority inheritance.
+		 */
+		.sched_runtime	= 1000000,
+		.sched_deadline = 10000000,
+		.sched_period	= 10000000,
+	};
+	int ret;
+
+	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
+		return;
+
+	kworker_fie = kthread_create_worker(0, "cppc_fie");
+	if (IS_ERR(kworker_fie))
+		return;
+
+	ret = sched_setattr_nocheck(kworker_fie->task, &attr);
+	if (ret) {
+		pr_warn("%s: failed to set SCHED_DEADLINE: %d\n", __func__,
+			ret);
+		kthread_destroy_worker(kworker_fie);
+		return;
+	}
+}
+
+static void cppc_freq_invariance_exit(void)
+{
+	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
+		return;
+
+	kthread_destroy_worker(kworker_fie);
+	kworker_fie = NULL;
+}
+
+#else
+static inline int cppc_cpufreq_cpu_fie_init(struct cpufreq_policy *policy)
+{
+	return 0;
+}
+
+static inline void cppc_cpufreq_cpu_fie_exit(struct cpufreq_policy *policy)
+{
+}
+
+static inline void cppc_freq_invariance_init(void)
+{
+}
+
+static inline void cppc_freq_invariance_exit(void)
+{
+}
+#endif /* CONFIG_ACPI_CPPC_CPUFREQ_FIE */
+
 /* Callback function used to retrieve the max frequency from DMI */
 static void cppc_find_dmi_mhz(const struct dmi_header *dm, void *private)
 {
@@ -324,11 +536,13 @@ static int cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
 	cpu_data->perf_ctrls.desired_perf =  caps->highest_perf;
 
 	ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
-	if (ret)
+	if (ret) {
 		pr_debug("Err setting perf value:%d on CPU:%d. ret:%d\n",
 			 caps->highest_perf, cpu, ret);
+		return ret;
+	}
 
-	return ret;
+	return cppc_cpufreq_cpu_fie_init(policy);
 }
 
 static int cppc_cpufreq_cpu_exit(struct cpufreq_policy *policy)
@@ -338,6 +552,8 @@ static int cppc_cpufreq_cpu_exit(struct cpufreq_policy *policy)
 	unsigned int cpu = policy->cpu;
 	int ret;
 
+	cppc_cpufreq_cpu_fie_exit(policy);
+
 	cpu_data->perf_ctrls.desired_perf = caps->lowest_perf;
 
 	ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
@@ -362,26 +578,35 @@ static inline u64 get_delta(u64 t1, u64 t0)
 	return (u32)t1 - (u32)t0;
 }
 
-static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu_data,
-				     struct cppc_perf_fb_ctrs fb_ctrs_t0,
-				     struct cppc_perf_fb_ctrs fb_ctrs_t1)
+static int cppc_perf_from_fbctrs(struct cppc_cpudata *cpu_data,
+				 struct cppc_perf_fb_ctrs *fb_ctrs_t0,
+				 struct cppc_perf_fb_ctrs *fb_ctrs_t1)
 {
 	u64 delta_reference, delta_delivered;
-	u64 reference_perf, delivered_perf;
+	u64 reference_perf;
 
-	reference_perf = fb_ctrs_t0.reference_perf;
+	reference_perf = fb_ctrs_t0->reference_perf;
 
-	delta_reference = get_delta(fb_ctrs_t1.reference,
-				    fb_ctrs_t0.reference);
-	delta_delivered = get_delta(fb_ctrs_t1.delivered,
-				    fb_ctrs_t0.delivered);
+	delta_reference = get_delta(fb_ctrs_t1->reference,
+				    fb_ctrs_t0->reference);
+	delta_delivered = get_delta(fb_ctrs_t1->delivered,
+				    fb_ctrs_t0->delivered);
 
-	/* Check to avoid divide-by zero */
-	if (delta_reference || delta_delivered)
-		delivered_perf = (reference_perf * delta_delivered) /
-					delta_reference;
-	else
-		delivered_perf = cpu_data->perf_ctrls.desired_perf;
+	/* Check to avoid divide-by zero and invalid delivered_perf */
+	if (!delta_reference || !delta_delivered)
+		return cpu_data->perf_ctrls.desired_perf;
+
+	return (reference_perf * delta_delivered) / delta_reference;
+}
+
+static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu_data,
+				     struct cppc_perf_fb_ctrs *fb_ctrs_t0,
+				     struct cppc_perf_fb_ctrs *fb_ctrs_t1)
+{
+	u64 delivered_perf;
+
+	delivered_perf = cppc_perf_from_fbctrs(cpu_data, fb_ctrs_t0,
+					       fb_ctrs_t1);
 
 	return cppc_cpufreq_perf_to_khz(cpu_data, delivered_perf);
 }
@@ -405,7 +630,7 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpu)
 	if (ret)
 		return ret;
 
-	return cppc_get_rate_from_fbctrs(cpu_data, fb_ctrs_t0, fb_ctrs_t1);
+	return cppc_get_rate_from_fbctrs(cpu_data, &fb_ctrs_t0, &fb_ctrs_t1);
 }
 
 static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state)
@@ -506,14 +731,21 @@ static void cppc_check_hisi_workaround(void)
 
 static int __init cppc_cpufreq_init(void)
 {
+	int ret;
+
 	if ((acpi_disabled) || !acpi_cpc_valid())
 		return -ENODEV;
 
 	INIT_LIST_HEAD(&cpu_data_list);
 
 	cppc_check_hisi_workaround();
+	cppc_freq_invariance_init();
 
-	return cpufreq_register_driver(&cppc_cpufreq_driver);
+	ret = cpufreq_register_driver(&cppc_cpufreq_driver);
+	if (ret)
+		cppc_freq_invariance_exit();
+
+	return ret;
 }
 
 static inline void free_cpu_data(void)
@@ -531,6 +763,7 @@ static inline void free_cpu_data(void)
 static void __exit cppc_cpufreq_exit(void)
 {
 	cpufreq_unregister_driver(&cppc_cpufreq_driver);
+	cppc_freq_invariance_exit();
 
 	free_cpu_data();
 }

-- 
viresh
Ionela Voinescu June 18, 2021, 12:26 p.m. UTC | #6
Hey,

On Friday 18 Jun 2021 at 13:07:13 (+0530), Viresh Kumar wrote:
> On 17-06-21, 11:34, Ionela Voinescu wrote:

> > We can't queue any more work for it as there's no tick for the offline

> > CPU.

> 

> Okay, so as discussed yesterday and confirmed by Peter, CPU hotplug shouldn't be

> a problem and we can avoid handling that here. Good.

> 

> The patch 1/3 from this series is not required anymore. I will get rid of

> stop_cpu() as well.

> 

> The second patch stays as is, as I still don't see another way of fixing the

> same problem on policy ->exit(). We still need that guarantee from topology

> core.

>


Right! That is because we need to make sure the tick does not queue any
more irq work while we do our cleanup (irq_work_sync() and then
kthread_cancel_work_sync()). Then kthread_cancel_work_sync() would
ensure the last worker wraps up so we can free the data.

I was hoping to avoid this but I don't currently see another way either.

> > P.S. I'll give more thought to the rcu use in the arch_topology driver.

> > I'm the boring kind that likes to err on the side of caution, so I tend

> > to agree that it might be good to have even if the current problem could

> > be solved in this driver.

> 

> Before I resend the series again, maybe it is better to align on the idea here

> itself first (as I need to fix some existing potential memleaks in policy

> ->init() first). So here is the new diff, looks okay now ?

> 

> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c

> index be4f62e2c5f1..3e9070f107c5 100644

> --- a/drivers/cpufreq/cppc_cpufreq.c

> +++ b/drivers/cpufreq/cppc_cpufreq.c

> @@ -10,14 +10,18 @@

>  

>  #define pr_fmt(fmt)	"CPPC Cpufreq:"	fmt

>  

> +#include <linux/arch_topology.h>

>  #include <linux/kernel.h>

>  #include <linux/module.h>

>  #include <linux/delay.h>

>  #include <linux/cpu.h>

>  #include <linux/cpufreq.h>

>  #include <linux/dmi.h>

> +#include <linux/irq_work.h>

> +#include <linux/kthread.h>

>  #include <linux/time.h>

>  #include <linux/vmalloc.h>

> +#include <uapi/linux/sched/types.h>

>  

>  #include <asm/unaligned.h>

>  

> @@ -57,6 +61,214 @@ static struct cppc_workaround_oem_info wa_info[] = {

>  	}

>  };

>  

> +#ifdef CONFIG_ACPI_CPPC_CPUFREQ_FIE

> +

> +/* Frequency invariance support */

> +struct cppc_freq_invariance {

> +	int cpu;

> +	struct irq_work irq_work;

> +	struct kthread_work work;

> +	struct cppc_perf_fb_ctrs prev_perf_fb_ctrs;

> +	struct cppc_cpudata *cpu_data;

> +};

> +

> +static DEFINE_PER_CPU(struct cppc_freq_invariance, cppc_freq_inv);

> +static struct kthread_worker *kworker_fie;

> +

> +static struct cpufreq_driver cppc_cpufreq_driver;

> +static unsigned int hisi_cppc_cpufreq_get_rate(unsigned int cpu);

> +static int cppc_perf_from_fbctrs(struct cppc_cpudata *cpu_data,

> +				 struct cppc_perf_fb_ctrs *fb_ctrs_t0,

> +				 struct cppc_perf_fb_ctrs *fb_ctrs_t1);

> +

> +/**

> + * cppc_scale_freq_workfn - CPPC arch_freq_scale updater for frequency invariance

> + * @work: The work item.

> + *

> + * The CPPC driver register itself with the topology core to provide its own

> + * implementation (cppc_scale_freq_tick()) of topology_scale_freq_tick() which

> + * gets called by the scheduler on every tick.

> + *

> + * Note that the arch specific counters have higher priority than CPPC counters,

> + * if available, though the CPPC driver doesn't need to have any special

> + * handling for that.

> + *

> + * On an invocation of cppc_scale_freq_tick(), we schedule an irq work (since we

> + * reach here from hard-irq context), which then schedules a normal work item

> + * and cppc_scale_freq_workfn() updates the per_cpu arch_freq_scale variable

> + * based on the counter updates since the last tick.

> + */

> +static void cppc_scale_freq_workfn(struct kthread_work *work)

> +{

> +	struct cppc_freq_invariance *cppc_fi;

> +	struct cppc_perf_fb_ctrs fb_ctrs = {0};

> +	struct cppc_cpudata *cpu_data;

> +	unsigned long local_freq_scale;

> +	u64 perf;

> +

> +	cppc_fi = container_of(work, struct cppc_freq_invariance, work);

> +	cpu_data = cppc_fi->cpu_data;

> +

> +	if (cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs)) {

> +		pr_warn("%s: failed to read perf counters\n", __func__);

> +		return;

> +	}

> +

> +	perf = cppc_perf_from_fbctrs(cpu_data, &cppc_fi->prev_perf_fb_ctrs,

> +				     &fb_ctrs);

> +	cppc_fi->prev_perf_fb_ctrs = fb_ctrs;

> +

> +	perf <<= SCHED_CAPACITY_SHIFT;

> +	local_freq_scale = div64_u64(perf, cpu_data->perf_caps.highest_perf);

> +

> +	/* This can happen due to counter overflows */

> +	if (unlikely(local_freq_scale > 1024))

> +		local_freq_scale = 1024;

> +

> +	per_cpu(arch_freq_scale, cppc_fi->cpu) = local_freq_scale;

> +}

> +

> +static void cppc_irq_work(struct irq_work *irq_work)

> +{

> +	struct cppc_freq_invariance *cppc_fi;

> +

> +	cppc_fi = container_of(irq_work, struct cppc_freq_invariance, irq_work);

> +	kthread_queue_work(kworker_fie, &cppc_fi->work);

> +}

> +

> +static void cppc_scale_freq_tick(void)

> +{

> +	struct cppc_freq_invariance *cppc_fi = &per_cpu(cppc_freq_inv, smp_processor_id());

> +

> +	/*

> +	 * cppc_get_perf_ctrs() can potentially sleep, call that from the right

> +	 * context.

> +	 */

> +	irq_work_queue(&cppc_fi->irq_work);

> +}

> +

> +static struct scale_freq_data cppc_sftd = {

> +	.source = SCALE_FREQ_SOURCE_CPPC,

> +	.set_freq_scale = cppc_scale_freq_tick,

> +};

> +

> +static int cppc_cpufreq_cpu_fie_init(struct cpufreq_policy *policy)

> +{

> +	struct cppc_freq_invariance *cppc_fi;

> +	int cpu, ret;

> +

> +	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)

> +		return 0;

> +

> +	for_each_cpu(cpu, policy->cpus) {

> +		cppc_fi = &per_cpu(cppc_freq_inv, cpu);

> +		cppc_fi->cpu = cpu;

> +		cppc_fi->cpu_data = policy->driver_data;

> +		kthread_init_work(&cppc_fi->work, cppc_scale_freq_workfn);

> +		init_irq_work(&cppc_fi->irq_work, cppc_irq_work);

> +

> +		ret = cppc_get_perf_ctrs(cpu, &cppc_fi->prev_perf_fb_ctrs);

> +		if (ret) {

> +			pr_warn("%s: failed to read perf counters for cpu:%d: %d\n",

> +				__func__, cpu, ret);

> +

> +			/* Avoid failing driver's probe because of FIE */

> +			return 0;

> +		}

> +	}

> +

> +	/* Register for freq-invariance */

> +	topology_set_scale_freq_source(&cppc_sftd, policy->cpus);

> +	return 0;

> +}

> +

> +/*

> + * We free all the resources on policy's removal and not on CPU removal as the

> + * irq-work are per-cpu and the hotplug core takes care of flushing the pending

> + * irq-works (hint: smpcfd_dying_cpu()) on CPU hotplug. Even if the kthread-work

> + * fires on another CPU after the concerned CPU is removed, it won't harm.

> + *

> + * We just need to make sure to remove them all on policy->exit().

> + */

> +static void cppc_cpufreq_cpu_fie_exit(struct cpufreq_policy *policy)

> +{

> +	struct cppc_freq_invariance *cppc_fi;

> +	int cpu;

> +

> +	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)

> +		return;

> +

> +	/* policy->cpus will be empty here, use related_cpus instead */

> +	topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_CPPC, policy->related_cpus);

> +

> +	for_each_cpu(cpu, policy->related_cpus) {

> +		cppc_fi = &per_cpu(cppc_freq_inv, cpu);

> +		irq_work_sync(&cppc_fi->irq_work);

> +		kthread_cancel_work_sync(&cppc_fi->work);

> +	}

> +}


Yep, looks good (with its call on cppc_cpufreq_cpu_exit()).

Feel free to post V3 and I'll take an even closer look during testing.

Thanks,
Ionela.


> +

> +static void __init cppc_freq_invariance_init(void)

> +{

> +	struct sched_attr attr = {

> +		.size		= sizeof(struct sched_attr),

> +		.sched_policy	= SCHED_DEADLINE,

> +		.sched_nice	= 0,

> +		.sched_priority	= 0,

> +		/*

> +		 * Fake (unused) bandwidth; workaround to "fix"

> +		 * priority inheritance.

> +		 */

> +		.sched_runtime	= 1000000,

> +		.sched_deadline = 10000000,

> +		.sched_period	= 10000000,

> +	};

> +	int ret;

> +

> +	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)

> +		return;

> +

> +	kworker_fie = kthread_create_worker(0, "cppc_fie");

> +	if (IS_ERR(kworker_fie))

> +		return;

> +

> +	ret = sched_setattr_nocheck(kworker_fie->task, &attr);

> +	if (ret) {

> +		pr_warn("%s: failed to set SCHED_DEADLINE: %d\n", __func__,

> +			ret);

> +		kthread_destroy_worker(kworker_fie);

> +		return;

> +	}

> +}

> +

> +static void cppc_freq_invariance_exit(void)

> +{

> +	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)

> +		return;

> +

> +	kthread_destroy_worker(kworker_fie);

> +	kworker_fie = NULL;

> +}

> +

> +#else

> +static inline int cppc_cpufreq_cpu_fie_init(struct cpufreq_policy *policy)

> +{

> +	return 0;

> +}

> +

> +static inline void cppc_cpufreq_cpu_fie_exit(struct cpufreq_policy *policy)

> +{

> +}

> +

> +static inline void cppc_freq_invariance_init(void)

> +{

> +}

> +

> +static inline void cppc_freq_invariance_exit(void)

> +{

> +}

> +#endif /* CONFIG_ACPI_CPPC_CPUFREQ_FIE */

> +

>  /* Callback function used to retrieve the max frequency from DMI */

>  static void cppc_find_dmi_mhz(const struct dmi_header *dm, void *private)

>  {

> @@ -324,11 +536,13 @@ static int cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)

>  	cpu_data->perf_ctrls.desired_perf =  caps->highest_perf;

>  

>  	ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);

> -	if (ret)

> +	if (ret) {

>  		pr_debug("Err setting perf value:%d on CPU:%d. ret:%d\n",

>  			 caps->highest_perf, cpu, ret);

> +		return ret;

> +	}

>  

> -	return ret;

> +	return cppc_cpufreq_cpu_fie_init(policy);

>  }

>  

>  static int cppc_cpufreq_cpu_exit(struct cpufreq_policy *policy)

> @@ -338,6 +552,8 @@ static int cppc_cpufreq_cpu_exit(struct cpufreq_policy *policy)

>  	unsigned int cpu = policy->cpu;

>  	int ret;

>  

> +	cppc_cpufreq_cpu_fie_exit(policy);

> +

>  	cpu_data->perf_ctrls.desired_perf = caps->lowest_perf;

>  

>  	ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);

> @@ -362,26 +578,35 @@ static inline u64 get_delta(u64 t1, u64 t0)

>  	return (u32)t1 - (u32)t0;

>  }

>  

> -static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu_data,

> -				     struct cppc_perf_fb_ctrs fb_ctrs_t0,

> -				     struct cppc_perf_fb_ctrs fb_ctrs_t1)

> +static int cppc_perf_from_fbctrs(struct cppc_cpudata *cpu_data,

> +				 struct cppc_perf_fb_ctrs *fb_ctrs_t0,

> +				 struct cppc_perf_fb_ctrs *fb_ctrs_t1)

>  {

>  	u64 delta_reference, delta_delivered;

> -	u64 reference_perf, delivered_perf;

> +	u64 reference_perf;

>  

> -	reference_perf = fb_ctrs_t0.reference_perf;

> +	reference_perf = fb_ctrs_t0->reference_perf;

>  

> -	delta_reference = get_delta(fb_ctrs_t1.reference,

> -				    fb_ctrs_t0.reference);

> -	delta_delivered = get_delta(fb_ctrs_t1.delivered,

> -				    fb_ctrs_t0.delivered);

> +	delta_reference = get_delta(fb_ctrs_t1->reference,

> +				    fb_ctrs_t0->reference);

> +	delta_delivered = get_delta(fb_ctrs_t1->delivered,

> +				    fb_ctrs_t0->delivered);

>  

> -	/* Check to avoid divide-by zero */

> -	if (delta_reference || delta_delivered)

> -		delivered_perf = (reference_perf * delta_delivered) /

> -					delta_reference;

> -	else

> -		delivered_perf = cpu_data->perf_ctrls.desired_perf;

> +	/* Check to avoid divide-by zero and invalid delivered_perf */

> +	if (!delta_reference || !delta_delivered)

> +		return cpu_data->perf_ctrls.desired_perf;

> +

> +	return (reference_perf * delta_delivered) / delta_reference;

> +}

> +

> +static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu_data,

> +				     struct cppc_perf_fb_ctrs *fb_ctrs_t0,

> +				     struct cppc_perf_fb_ctrs *fb_ctrs_t1)

> +{

> +	u64 delivered_perf;

> +

> +	delivered_perf = cppc_perf_from_fbctrs(cpu_data, fb_ctrs_t0,

> +					       fb_ctrs_t1);

>  

>  	return cppc_cpufreq_perf_to_khz(cpu_data, delivered_perf);

>  }

> @@ -405,7 +630,7 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpu)

>  	if (ret)

>  		return ret;

>  

> -	return cppc_get_rate_from_fbctrs(cpu_data, fb_ctrs_t0, fb_ctrs_t1);

> +	return cppc_get_rate_from_fbctrs(cpu_data, &fb_ctrs_t0, &fb_ctrs_t1);

>  }

>  

>  static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state)

> @@ -506,14 +731,21 @@ static void cppc_check_hisi_workaround(void)

>  

>  static int __init cppc_cpufreq_init(void)

>  {

> +	int ret;

> +

>  	if ((acpi_disabled) || !acpi_cpc_valid())

>  		return -ENODEV;

>  

>  	INIT_LIST_HEAD(&cpu_data_list);

>  

>  	cppc_check_hisi_workaround();

> +	cppc_freq_invariance_init();

>  

> -	return cpufreq_register_driver(&cppc_cpufreq_driver);

> +	ret = cpufreq_register_driver(&cppc_cpufreq_driver);

> +	if (ret)

> +		cppc_freq_invariance_exit();

> +

> +	return ret;

>  }

>  

>  static inline void free_cpu_data(void)

> @@ -531,6 +763,7 @@ static inline void free_cpu_data(void)

>  static void __exit cppc_cpufreq_exit(void)

>  {

>  	cpufreq_unregister_driver(&cppc_cpufreq_driver);

> +	cppc_freq_invariance_exit();

>  

>  	free_cpu_data();

>  }

> 

> -- 

> viresh
diff mbox series

Patch

diff --git a/drivers/cpufreq/Kconfig.arm b/drivers/cpufreq/Kconfig.arm
index e65e0a43be64..a5c5f70acfc9 100644
--- a/drivers/cpufreq/Kconfig.arm
+++ b/drivers/cpufreq/Kconfig.arm
@@ -19,6 +19,16 @@  config ACPI_CPPC_CPUFREQ
 
 	  If in doubt, say N.
 
+config ACPI_CPPC_CPUFREQ_FIE
+	bool "Frequency Invariance support for CPPC cpufreq driver"
+	depends on ACPI_CPPC_CPUFREQ && GENERIC_ARCH_TOPOLOGY
+	default y
+	help
+	  This extends frequency invariance support in the CPPC cpufreq driver,
+	  by using CPPC delivered and reference performance counters.
+
+	  If in doubt, say N.
+
 config ARM_ALLWINNER_SUN50I_CPUFREQ_NVMEM
 	tristate "Allwinner nvmem based SUN50I CPUFreq driver"
 	depends on ARCH_SUNXI
diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
index be4f62e2c5f1..63e4cff46f95 100644
--- a/drivers/cpufreq/cppc_cpufreq.c
+++ b/drivers/cpufreq/cppc_cpufreq.c
@@ -10,14 +10,18 @@ 
 
 #define pr_fmt(fmt)	"CPPC Cpufreq:"	fmt
 
+#include <linux/arch_topology.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/delay.h>
 #include <linux/cpu.h>
 #include <linux/cpufreq.h>
 #include <linux/dmi.h>
+#include <linux/irq_work.h>
+#include <linux/kthread.h>
 #include <linux/time.h>
 #include <linux/vmalloc.h>
+#include <uapi/linux/sched/types.h>
 
 #include <asm/unaligned.h>
 
@@ -57,6 +61,183 @@  static struct cppc_workaround_oem_info wa_info[] = {
 	}
 };
 
+#ifdef CONFIG_ACPI_CPPC_CPUFREQ_FIE
+
+/* Frequency invariance support */
+struct cppc_freq_invariance {
+	int cpu;
+	struct irq_work irq_work;
+	struct kthread_work work;
+	struct cppc_perf_fb_ctrs prev_perf_fb_ctrs;
+	struct cppc_cpudata *cpu_data;
+};
+
+static DEFINE_PER_CPU(struct cppc_freq_invariance, cppc_freq_inv);
+static struct kthread_worker *kworker_fie;
+
+static struct cpufreq_driver cppc_cpufreq_driver;
+static unsigned int hisi_cppc_cpufreq_get_rate(unsigned int cpu);
+static int cppc_perf_from_fbctrs(struct cppc_cpudata *cpu_data,
+				 struct cppc_perf_fb_ctrs *fb_ctrs_t0,
+				 struct cppc_perf_fb_ctrs *fb_ctrs_t1);
+
+/**
+ * cppc_scale_freq_workfn - CPPC arch_freq_scale updater for frequency invariance
+ * @work: The work item.
+ *
+ * The CPPC driver register itself with the topology core to provide its own
+ * implementation (cppc_scale_freq_tick()) of topology_scale_freq_tick() which
+ * gets called by the scheduler on every tick.
+ *
+ * Note that the arch specific counters have higher priority than CPPC counters,
+ * if available, though the CPPC driver doesn't need to have any special
+ * handling for that.
+ *
+ * On an invocation of cppc_scale_freq_tick(), we schedule an irq work (since we
+ * reach here from hard-irq context), which then schedules a normal work item
+ * and cppc_scale_freq_workfn() updates the per_cpu arch_freq_scale variable
+ * based on the counter updates since the last tick.
+ */
+static void cppc_scale_freq_workfn(struct kthread_work *work)
+{
+	struct cppc_freq_invariance *cppc_fi;
+	struct cppc_perf_fb_ctrs fb_ctrs = {0};
+	struct cppc_cpudata *cpu_data;
+	unsigned long local_freq_scale;
+	u64 perf;
+
+	cppc_fi = container_of(work, struct cppc_freq_invariance, work);
+	cpu_data = cppc_fi->cpu_data;
+
+	if (cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs)) {
+		pr_warn("%s: failed to read perf counters\n", __func__);
+		return;
+	}
+
+	perf = cppc_perf_from_fbctrs(cpu_data, &cppc_fi->prev_perf_fb_ctrs,
+				     &fb_ctrs);
+	cppc_fi->prev_perf_fb_ctrs = fb_ctrs;
+
+	perf <<= SCHED_CAPACITY_SHIFT;
+	local_freq_scale = div64_u64(perf, cpu_data->perf_caps.highest_perf);
+	if (WARN_ON(local_freq_scale > 1024))
+		local_freq_scale = 1024;
+
+	per_cpu(arch_freq_scale, cppc_fi->cpu) = local_freq_scale;
+}
+
+static void cppc_irq_work(struct irq_work *irq_work)
+{
+	struct cppc_freq_invariance *cppc_fi;
+
+	cppc_fi = container_of(irq_work, struct cppc_freq_invariance, irq_work);
+	kthread_queue_work(kworker_fie, &cppc_fi->work);
+}
+
+static void cppc_scale_freq_tick(void)
+{
+	struct cppc_freq_invariance *cppc_fi = &per_cpu(cppc_freq_inv, smp_processor_id());
+
+	/*
+	 * cppc_get_perf_ctrs() can potentially sleep, call that from the right
+	 * context.
+	 */
+	irq_work_queue(&cppc_fi->irq_work);
+}
+
+static struct scale_freq_data cppc_sftd = {
+	.source = SCALE_FREQ_SOURCE_CPPC,
+	.set_freq_scale = cppc_scale_freq_tick,
+};
+
+static void cppc_cpufreq_start_cpu(struct cpufreq_policy *policy,
+				   unsigned int cpu)
+{
+	struct cppc_freq_invariance *cppc_fi = &per_cpu(cppc_freq_inv, cpu);
+	int ret;
+
+	cppc_fi->cpu = cpu;
+	cppc_fi->cpu_data = policy->driver_data;
+	kthread_init_work(&cppc_fi->work, cppc_scale_freq_workfn);
+	init_irq_work(&cppc_fi->irq_work, cppc_irq_work);
+
+	ret = cppc_get_perf_ctrs(cpu, &cppc_fi->prev_perf_fb_ctrs);
+	if (ret) {
+		pr_warn("%s: failed to read perf counters: %d\n", __func__,
+			ret);
+		return;
+	}
+
+	/* Register for freq-invariance */
+	topology_set_scale_freq_source(&cppc_sftd, cpumask_of(cpu));
+}
+
+static void cppc_cpufreq_stop_cpu(struct cpufreq_policy *policy,
+				  unsigned int cpu)
+{
+	struct cppc_freq_invariance *cppc_fi = &per_cpu(cppc_freq_inv, cpu);
+
+	topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_CPPC, cpumask_of(cpu));
+
+	irq_work_sync(&cppc_fi->irq_work);
+	kthread_cancel_work_sync(&cppc_fi->work);
+}
+
+static void __init cppc_freq_invariance_init(void)
+{
+	struct sched_attr attr = {
+		.size		= sizeof(struct sched_attr),
+		.sched_policy	= SCHED_DEADLINE,
+		.sched_nice	= 0,
+		.sched_priority	= 0,
+		/*
+		 * Fake (unused) bandwidth; workaround to "fix"
+		 * priority inheritance.
+		 */
+		.sched_runtime	= 1000000,
+		.sched_deadline = 10000000,
+		.sched_period	= 10000000,
+	};
+	int ret;
+
+	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
+		return;
+
+	kworker_fie = kthread_create_worker(0, "cppc_fie");
+	if (IS_ERR(kworker_fie))
+		return;
+
+	ret = sched_setattr_nocheck(kworker_fie->task, &attr);
+	if (ret) {
+		pr_warn("%s: failed to set SCHED_DEADLINE: %d\n", __func__,
+			ret);
+		kthread_destroy_worker(kworker_fie);
+		return;
+	}
+
+	cppc_cpufreq_driver.start_cpu = cppc_cpufreq_start_cpu;
+	cppc_cpufreq_driver.stop_cpu = cppc_cpufreq_stop_cpu;
+}
+
+static void cppc_freq_invariance_exit(void)
+{
+	if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
+		return;
+
+	kthread_destroy_worker(kworker_fie);
+	kworker_fie = NULL;
+}
+
+#else
+static inline void cppc_freq_invariance_init(void)
+{
+}
+
+static inline void cppc_freq_invariance_exit(void)
+{
+}
+#endif /* CONFIG_ACPI_CPPC_CPUFREQ_FIE */
+
 /* Callback function used to retrieve the max frequency from DMI */
 static void cppc_find_dmi_mhz(const struct dmi_header *dm, void *private)
 {
@@ -362,26 +543,35 @@  static inline u64 get_delta(u64 t1, u64 t0)
 	return (u32)t1 - (u32)t0;
 }
 
-static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu_data,
-				     struct cppc_perf_fb_ctrs fb_ctrs_t0,
-				     struct cppc_perf_fb_ctrs fb_ctrs_t1)
+static int cppc_perf_from_fbctrs(struct cppc_cpudata *cpu_data,
+				 struct cppc_perf_fb_ctrs *fb_ctrs_t0,
+				 struct cppc_perf_fb_ctrs *fb_ctrs_t1)
 {
 	u64 delta_reference, delta_delivered;
-	u64 reference_perf, delivered_perf;
+	u64 reference_perf;
 
-	reference_perf = fb_ctrs_t0.reference_perf;
+	reference_perf = fb_ctrs_t0->reference_perf;
 
-	delta_reference = get_delta(fb_ctrs_t1.reference,
-				    fb_ctrs_t0.reference);
-	delta_delivered = get_delta(fb_ctrs_t1.delivered,
-				    fb_ctrs_t0.delivered);
+	delta_reference = get_delta(fb_ctrs_t1->reference,
+				    fb_ctrs_t0->reference);
+	delta_delivered = get_delta(fb_ctrs_t1->delivered,
+				    fb_ctrs_t0->delivered);
 
-	/* Check to avoid divide-by zero */
-	if (delta_reference || delta_delivered)
-		delivered_perf = (reference_perf * delta_delivered) /
-					delta_reference;
-	else
-		delivered_perf = cpu_data->perf_ctrls.desired_perf;
+	/* Check to avoid divide-by zero and invalid delivered_perf */
+	if (!delta_reference || !delta_delivered)
+		return cpu_data->perf_ctrls.desired_perf;
+
+	return (reference_perf * delta_delivered) / delta_reference;
+}
+
+static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu_data,
+				     struct cppc_perf_fb_ctrs *fb_ctrs_t0,
+				     struct cppc_perf_fb_ctrs *fb_ctrs_t1)
+{
+	u64 delivered_perf;
+
+	delivered_perf = cppc_perf_from_fbctrs(cpu_data, fb_ctrs_t0,
+					       fb_ctrs_t1);
 
 	return cppc_cpufreq_perf_to_khz(cpu_data, delivered_perf);
 }
@@ -405,7 +595,7 @@  static unsigned int cppc_cpufreq_get_rate(unsigned int cpu)
 	if (ret)
 		return ret;
 
-	return cppc_get_rate_from_fbctrs(cpu_data, fb_ctrs_t0, fb_ctrs_t1);
+	return cppc_get_rate_from_fbctrs(cpu_data, &fb_ctrs_t0, &fb_ctrs_t1);
 }
 
 static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state)
@@ -506,14 +696,21 @@  static void cppc_check_hisi_workaround(void)
 
 static int __init cppc_cpufreq_init(void)
 {
+	int ret;
+
 	if ((acpi_disabled) || !acpi_cpc_valid())
 		return -ENODEV;
 
 	INIT_LIST_HEAD(&cpu_data_list);
 
 	cppc_check_hisi_workaround();
+	cppc_freq_invariance_init();
 
-	return cpufreq_register_driver(&cppc_cpufreq_driver);
+	ret = cpufreq_register_driver(&cppc_cpufreq_driver);
+	if (ret)
+		cppc_freq_invariance_exit();
+
+	return ret;
 }
 
 static inline void free_cpu_data(void)
@@ -531,6 +728,7 @@  static inline void free_cpu_data(void)
 static void __exit cppc_cpufreq_exit(void)
 {
 	cpufreq_unregister_driver(&cppc_cpufreq_driver);
+	cppc_freq_invariance_exit();
 
 	free_cpu_data();
 }
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index 11e555cfaecb..f180240dc95f 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -37,6 +37,7 @@  bool topology_scale_freq_invariant(void);
 enum scale_freq_source {
 	SCALE_FREQ_SOURCE_CPUFREQ = 0,
 	SCALE_FREQ_SOURCE_ARCH,
+	SCALE_FREQ_SOURCE_CPPC,
 };
 
 struct scale_freq_data {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4ca80df205ce..5226cc26a095 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6389,6 +6389,7 @@  int sched_setattr_nocheck(struct task_struct *p, const struct sched_attr *attr)
 {
 	return __sched_setscheduler(p, attr, false, true);
 }
+EXPORT_SYMBOL_GPL(sched_setattr_nocheck);
 
 /**
  * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.