diff mbox

[RFC,7/7] sched: energy_model: simple cpu frequency scaling policy

Message ID 1413958051-7103-8-git-send-email-mturquette@linaro.org
State New
Headers show

Commit Message

Mike Turquette Oct. 22, 2014, 6:07 a.m. UTC
Building on top of the scale invariant capacity patches and earlier
patches in this series that prepare CFS for scaling cpu frequency, this
patch implements a simple, naive ondemand-like cpu frequency scaling
policy that is driven by enqueue_task_fair and dequeue_tassk_fair. This
new policy is named "energy_model" as an homage to the on-going work in
that area. It is NOT an actual energy model.

This policy is implemented using the CPUfreq governor interface for two
main reasons:

1) re-using the CPUfreq machine drivers without using the governor
interface is hard. I do not forsee any issue continuing to use the
governor interface going forward but it is worth making clear what this
patch does up front.

2) using the CPUfreq interface allows us to switch between the
energy_model governor and other CPUfreq governors (such as ondemand) at
run-time. This is very useful for comparative testing and tuning.

A caveat to #2 above is that the weak arch function used by the governor
means that only one scheduler-driven policy can be linked at a time.
This limitation does not apply to "traditional" governors. I raised this
in my previous capacity_ops patches[0] but as discussed at LPC14 last
week, it seems desirable to pursue a single cpu frequency scaling policy
at first, and try to make that work for everyone interested in using it.
If that model breaks down then we can revisit the idea of dynamic
selection of scheduler-driven cpu frequency scaling.

Unlike legacy CPUfreq governors, this policy does not implement its own
logic loop (such as a workqueue triggered by a timer), but instead uses
an event-driven design. Frequency is evaluated by entering
{en,de}queue_task_fair and then a kthread is woken from
run_rebalance_domains which scales cpu frequency based on the latest
evaluation.

The policy implemented in this patch takes the highest cpu utilization
from policy->cpus and uses that select a frequency target based on the
same 80%/20% thresholds used as defaults in ondemand. Frequenecy-scaled
thresholds are pre-computed when energy_model inits. The frequency
selection is a simple comparison of cpu utilization (as defined in
Morten's latest RFC) to the threshold values. In the future this logic
could be replaced with something more sophisticated that uses PELT to
get a historical overview. Ideas are welcome.

Note that the pre-computed thresholds above do not take into account
micro-architecture differences (SMT or big.LITTLE hardware), only
frequency invariance.

Not-signed-off-by: Mike Turquette <mturquette@linaro.org>
---
 drivers/cpufreq/Kconfig     |  21 +++
 include/linux/cpufreq.h     |   3 +
 kernel/sched/Makefile       |   1 +
 kernel/sched/energy_model.c | 341 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 366 insertions(+)
 create mode 100644 kernel/sched/energy_model.c

Comments

Dietmar Eggemann Oct. 27, 2014, 7:43 p.m. UTC | #1
On 22/10/14 07:07, Mike Turquette wrote:
> Building on top of the scale invariant capacity patches and earlier

We don't have scale invariant capacity yet but scale invariant
load/utilization.

> patches in this series that prepare CFS for scaling cpu frequency, this
> patch implements a simple, naive ondemand-like cpu frequency scaling
> policy that is driven by enqueue_task_fair and dequeue_tassk_fair. This
> new policy is named "energy_model" as an homage to the on-going work in
> that area. It is NOT an actual energy model.

Maybe it's worth mentioning that you simply take SCHED_CAPACITY_SCALE
and multiply it with the OPP frequency/max frequency of that cpu to get
the capacity at that OPP. You're not using the capacity related energy
values 'struct capacity:cap' from the energy model which would have to
be measured for the particular platform.

[...]

> The policy implemented in this patch takes the highest cpu utilization
> from policy->cpus and uses that select a frequency target based on the
> same 80%/20% thresholds used as defaults in ondemand. Frequenecy-scaled
> thresholds are pre-computed when energy_model inits. The frequency
> selection is a simple comparison of cpu utilization (as defined in
> Morten's latest RFC) to the threshold values. In the future this logic
> could be replaced with something more sophisticated that uses PELT to
> get a historical overview. Ideas are welcome.

This is what I don't grasp. The se utilization contrib and the cfs_rq
utilization are PELT signals and they provide history information? I
mean comparing the cfs_rq utilization PELT signal with a number from an
energy model, that's essentially EAS.

> 
> Note that the pre-computed thresholds above do not take into account
> micro-architecture differences (SMT or big.LITTLE hardware), only
> frequency invariance.
> 
> Not-signed-off-by: Mike Turquette <mturquette@linaro.org>
> ---
>  drivers/cpufreq/Kconfig     |  21 +++
>  include/linux/cpufreq.h     |   3 +
>  kernel/sched/Makefile       |   1 +
>  kernel/sched/energy_model.c | 341 ++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 366 insertions(+)
>  create mode 100644 kernel/sched/energy_model.c
> 

[...]

> +/**
> + * em_data - per-policy data used by energy_mode
> + * @throttle: bail if current time is less than than ktime_throttle.
> + *                 Derived from THROTTLE_MSEC
> + * @up_threshold:   table of normalized capacity states to determine if cpu
> + *                 should run faster. Derived from UP_THRESHOLD
> + * @down_threshold: table of normalized capacity states to determine if cpu
> + *                 should run slower. Derived from DOWN_THRESHOLD
> + *
> + * struct em_data is the per-policy energy_model-specific data structure. A
> + * per-policy instance of it is created when the energy_model governor receives
> + * the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data
> + * member of struct cpufreq_policy.
> + *
> + * Readers of this data must call down_read(policy->rwsem). Writers must
> + * call down_write(policy->rwsem).
> + */
> +struct em_data {
> +       /* per-policy throttling */
> +       ktime_t throttle;
> +       unsigned int *up_threshold;
> +       unsigned int *down_threshold;
> +       struct task_struct *task;
> +       atomic_long_t target_freq;
> +       atomic_t need_wake_task;
> +};

On my Chromebook2 (Exynos 5 Octa 5800) I end up with 2 kernel threads
(one for each cluster). There is an 'for_each_online_cpu' in
arch_scale_cpu_freq and I can see that the em data thread is invoked for
both clusters every time. Is this the intended behaviour?

It looks like you achieve the desired behaviour for freq-scaling per
cluster for this system but it's not clear to me how this is done from
the design perspective and what would have to be changed if we want to
run it on a per-cpu frequency scaling system.

Coming back to your question where you should call arch_scale_cpu_freq.
Another issue is for which cpu you should call it? For EAS we want to be
able to either raise the cpu frequency of the busiest cpu or do task
migration away from the busiest cpu. So maybe arch_scale_cpu_freq should
be called later in load_balance when we figured out which one is the
busiest cpu?
This would map nicely to load balance in MC sd level for per-cpu
frequency scaling and in DIE sd level for per-cluster frequency scaling.
But then, where do you hook in to lower the frequency eventually? And
what happens in load-balance for all the other 'sd level <-> per-foo
frequency scaling' combinations?

[...]

> +
> +#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL
> +static
> +#endif
> +struct cpufreq_governor cpufreq_gov_energy_model = {
> +       .name                   = "energy_model",
> +       .governor               = energy_model_setup,
> +       .owner                  = THIS_MODULE,
> +};
> +
> +static int __init energy_model_init(void)
> +{
> +       return cpufreq_register_governor(&cpufreq_gov_energy_model);
> +}
> +

Probably not that important at this stage. I always hit

[    8.601824] ------------[ cut here ]------------
[    8.601869] WARNING: CPU: 6 PID: 3229 at
drivers/cpufreq/cpufreq_governor.c:266 cpufreq_governor_dbs+0x6f4/0x6f8()
[    8.601884] Modules linked in:
[    8.601912] CPU: 6 PID: 3229 Comm: cpufreq-set Not tainted
3.17.0-rc3-00293-g5cf54ebcaea6 #16
[    8.601953] [<c0015224>] (unwind_backtrace) from [<c0011cd4>]
(show_stack+0x18/0x1c)
[    8.601982] [<c0011cd4>] (show_stack) from [<c04c5b28>]
(dump_stack+0x80/0xc0)
[    8.602011] [<c04c5b28>] (dump_stack) from [<c0022fd8>]
(warn_slowpath_common+0x78/0x94)
[    8.602041] [<c0022fd8>] (warn_slowpath_common) from [<c00230a8>]
(warn_slowpath_null+0x24/0x2c)
[    8.602071] [<c00230a8>] (warn_slowpath_null) from [<c03a74c8>]
(cpufreq_governor_dbs+0x6f4/0x6f8)
[    8.602100] [<c03a74c8>] (cpufreq_governor_dbs) from [<c03a1b58>]
(__cpufreq_governor+0x140/0x240)
[    8.602126] [<c03a1b58>] (__cpufreq_governor) from [<c03a31b0>]
(cpufreq_set_policy+0x18c/0x20c)
[    8.602153] [<c03a31b0>] (cpufreq_set_policy) from [<c03a3400>]
(store_scaling_governor+0x78/0xa4)
[    8.602179] [<c03a3400>] (store_scaling_governor) from [<c03a149c>]
(store+0x94/0xc0)
[    8.602207] [<c03a149c>] (store) from [<c015c268>]
(kernfs_fop_write+0xc8/0x188)
[    8.602236] [<c015c268>] (kernfs_fop_write) from [<c00ffc00>]
(vfs_write+0xac/0x1b8)
[    8.602263] [<c00ffc00>] (vfs_write) from [<c010023c>]
(SyS_write+0x48/0x9c)
[    8.602290] [<c010023c>] (SyS_write) from [<c000e600>]
(ret_fast_syscall+0x0/0x30)
[    8.602307] ---[ end trace bedc9e3b94a57ef2 ]---

when I configure CONFIG_CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL=y during
initial system start.

[...]
Peter Zijlstra Oct. 28, 2014, 2:27 p.m. UTC | #2
On Tue, Oct 21, 2014 at 11:07:31PM -0700, Mike Turquette wrote:
> Unlike legacy CPUfreq governors, this policy does not implement its own
> logic loop (such as a workqueue triggered by a timer), but instead uses
> an event-driven design. Frequency is evaluated by entering
> {en,de}queue_task_fair and then a kthread is woken from
> run_rebalance_domains which scales cpu frequency based on the latest
> evaluation.

Also note that we probably want to extend the governor to include the
other sched classes, deadline for example is a good candidate to include
as it already explicitly provides utilization requirements from which
you can compute a hard minimum frequency, below which the task set is
unschedulable.

fifo/rr are far harder to do, since for them we don't have anything
useful, the best we can do I suppose is some statistical over
provisioning but no guarantees.
diff mbox

Patch

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index 22b42d5..78a2caa 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -102,6 +102,15 @@  config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
 	  Be aware that not all cpufreq drivers support the conservative
 	  governor. If unsure have a look at the help section of the
 	  driver. Fallback governor will be the performance governor.
+
+config CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL
+	bool "energy_model"
+	select CPU_FREQ_GOV_ENERGY_MODEL
+	select CPU_FREQ_GOV_PERFORMANCE
+	help
+	  Use the CPUfreq governor 'energy_model' as default. This
+	  scales cpu frequency from the scheduler as per-task statistics
+	  are updated.
 endchoice
 
 config CPU_FREQ_GOV_PERFORMANCE
@@ -183,6 +192,18 @@  config CPU_FREQ_GOV_CONSERVATIVE
 
 	  If in doubt, say N.
 
+config CPU_FREQ_GOV_ENERGY_MODEL
+	tristate "'energy model' cpufreq governor"
+	depends on CPU_FREQ
+	select CPU_FREQ_GOV_COMMON
+	help
+	  'energy_model' - this governor scales cpu frequency from the
+	  scheduler as a function of cpu utilization. It does not
+	  evaluate utilization on a periodic basis (unlike ondemand) but
+	  instead is invoked from CFS when updating per-task statistics.
+
+	  If in doubt, say N.
+
 config CPUFREQ_GENERIC
 	tristate "Generic cpufreq driver"
 	depends on HAVE_CLK && OF
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 91d173c..69cbbec 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -482,6 +482,9 @@  extern struct cpufreq_governor cpufreq_gov_ondemand;
 #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE)
 extern struct cpufreq_governor cpufreq_gov_conservative;
 #define CPUFREQ_DEFAULT_GOVERNOR	(&cpufreq_gov_conservative)
+#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL)
+extern struct cpufreq_governor cpufreq_gov_energy_model;
+#define CPUFREQ_DEFAULT_GOVERNOR	(&cpufreq_gov_energy_model)
 #endif
 
 /*********************************************************************
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index ab32b7b..7cd404c 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,3 +19,4 @@  obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
 obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_CPU_FREQ_GOV_ENERGY_MODEL) += energy_model.o
diff --git a/kernel/sched/energy_model.c b/kernel/sched/energy_model.c
new file mode 100644
index 0000000..5cdea9a
--- /dev/null
+++ b/kernel/sched/energy_model.c
@@ -0,0 +1,341 @@ 
+/*
+ *  Copyright (C)  2014 Michael Turquette <mturquette@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/cpufreq.h>
+#include <linux/module.h>
+#include <linux/kthread.h>
+
+#include "sched.h"
+
+#define THROTTLE_MSEC		50
+#define UP_THRESHOLD		80
+#define DOWN_THRESHOLD		20
+
+/**
+ * em_data - per-policy data used by energy_mode
+ * @throttle: bail if current time is less than than ktime_throttle.
+ * 		    Derived from THROTTLE_MSEC
+ * @up_threshold:   table of normalized capacity states to determine if cpu
+ * 		    should run faster. Derived from UP_THRESHOLD
+ * @down_threshold: table of normalized capacity states to determine if cpu
+ * 		    should run slower. Derived from DOWN_THRESHOLD
+ *
+ * struct em_data is the per-policy energy_model-specific data structure. A
+ * per-policy instance of it is created when the energy_model governor receives
+ * the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data
+ * member of struct cpufreq_policy.
+ *
+ * Readers of this data must call down_read(policy->rwsem). Writers must
+ * call down_write(policy->rwsem).
+ */
+struct em_data {
+	/* per-policy throttling */
+	ktime_t throttle;
+	unsigned int *up_threshold;
+	unsigned int *down_threshold;
+	struct task_struct *task;
+	atomic_long_t target_freq;
+	atomic_t need_wake_task;
+};
+
+/*
+ * we pass in struct cpufreq_policy. This is safe because changing out the
+ * policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),
+ * which tears all of the data structures down and __cpufreq_governor(policy,
+ * CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the
+ * new policy pointer
+ */
+static int energy_model_thread(void *data)
+{
+	struct sched_param param;
+	struct cpufreq_policy *policy;
+	struct em_data *em;
+	int ret;
+
+	policy = (struct cpufreq_policy *) data;
+	if (!policy) {
+		pr_warn("%s: missing policy\n", __func__);
+		do_exit(-EINVAL);
+	}
+
+	em = policy->gov_data;
+	if (!em) {
+		pr_warn("%s: missing governor data\n", __func__);
+		do_exit(-EINVAL);
+	}
+
+	param.sched_priority = 0;
+	sched_setscheduler(current, SCHED_FIFO, &param);
+
+
+	do {
+		down_write(&policy->rwsem);
+		if (!atomic_read(&em->need_wake_task))  {
+			up_write(&policy->rwsem);
+			set_current_state(TASK_INTERRUPTIBLE);
+			schedule();
+			continue;
+		}
+
+		ret = __cpufreq_driver_target(policy, atomic_read(&em->target_freq),
+				CPUFREQ_RELATION_H);
+		if (ret)
+			pr_debug("%s: __cpufreq_driver_target returned %d\n",
+					__func__, ret);
+
+		em->throttle = ktime_get();
+		atomic_set(&em->need_wake_task, 0);
+		up_write(&policy->rwsem);
+	} while (!kthread_should_stop());
+
+	do_exit(0);
+}
+
+static void em_wake_up_process(struct task_struct *task)
+{
+	/* this is null during early boot */
+	if (IS_ERR_OR_NULL(task)) {
+		return;
+	}
+
+	wake_up_process(task);
+}
+
+void arch_scale_cpu_freq(void)
+{
+	struct cpufreq_policy *policy;
+	struct em_data *em;
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		policy = cpufreq_cpu_get(cpu);
+		if (IS_ERR_OR_NULL(policy))
+			continue;
+
+		em = policy->gov_data;
+		if (!em)
+			continue;
+
+		/*
+		 * FIXME replace the atomic stuff by holding write-locks
+		 * in arch_eval_cpu_freq?
+		 */
+		if (atomic_read(&em->need_wake_task)) {
+			em_wake_up_process(em->task);
+		}
+
+		cpufreq_cpu_put(policy);
+	}
+}
+
+/**
+ * arch_eval_cpu_freq - scale cpu frequency based on CFS utilization
+ * @update_cpus: mask of CPUs with updated utilization and capacity
+ *
+ * Declared and weakly defined in kernel/sched/fair.c This definition overrides
+ * the default. In the case of CONFIG_FAIR_GROUP_SCHED, update_cpus may
+ * contains cpus that are not in the same policy. Otherwise update_cpus will be
+ * a single cpu.
+ *
+ * Holds read lock for policy->rw_sem.
+ *
+ * FIXME weak arch function means that only one definition of this function can
+ * be linked. How to support multiple energy model policies?
+ */
+void arch_eval_cpu_freq(struct cpumask *update_cpus)
+{
+	struct cpufreq_policy *policy;
+	struct em_data *em;
+	int index;
+	unsigned int cpu, tmp;
+	unsigned long percent_util = 0, max_util = 0, cap = 0, util = 0;
+
+	/*
+	 * In the case of CONFIG_FAIR_GROUP_SCHED, policy->cpus may be a subset
+	 * of update_cpus. In such case take the first cpu in update_cpus, get
+	 * its policy and try to scale the affects cpus. Then we clear the
+	 * corresponding bits from update_cpus and try again. If a policy does
+	 * not exist for a cpu then we remove that bit as well, preventing an
+	 * infinite loop.
+	 */
+	while (!cpumask_empty(update_cpus)) {
+		percent_util = 0;
+		max_util = 0;
+		cap = 0;
+		util = 0;
+
+		cpu = cpumask_first(update_cpus);
+		policy = cpufreq_cpu_get(cpu);
+		if (IS_ERR_OR_NULL(policy)) {
+			cpumask_clear_cpu(cpu, update_cpus);
+			continue;
+		}
+
+		if (!policy->gov_data)
+			return;
+
+		em = policy->gov_data;
+
+		if (ktime_before(ktime_get(), em->throttle)) {
+			trace_printk("THROTTLED");
+			goto bail;
+		}
+
+		/*
+		 * try scaling cpus
+		 *
+		 * algorithm assumptions & description:
+		 * 	all cpus in a policy run at the same rate/capacity.
+		 * 	choose frequency target based on most utilized cpu.
+		 * 	do not care about aggregating cpu utilization.
+		 * 	do not track any historical trends beyond utilization
+		 * 	if max_util > 80% of current capacity,
+		 * 		go to max capacity
+		 * 	if max_util < 20% of current capacity,
+		 * 		go to the next lowest capacity
+		 * 	otherwise, stay at the same capacity state
+		 */
+		for_each_cpu(tmp, policy->cpus) {
+			util = usage_util_of(cpu);
+			if (util > max_util)
+				max_util = util;
+		}
+
+		cap = capacity_of(cpu);
+		if (!cap) {
+			goto bail;
+		}
+
+		index = cpufreq_frequency_table_get_index(policy, policy->cur);
+		if (max_util > em->up_threshold[index]) {
+			/* write em->target_freq with read lock held */
+			atomic_long_set(&em->target_freq, policy->max);
+			/*
+			 * FIXME this is gross. convert arch_eval_cpu_freq to
+			 * hold the write lock?
+			 */
+			atomic_set(&em->need_wake_task, 1);
+		} else if (max_util < em->down_threshold[index]) {
+			/* write em->target_freq with read lock held */
+			atomic_long_set(&em->target_freq, policy->cur - 1);
+			/*
+			 * FIXME this is gross. convert arch_eval_cpu_freq to
+			 * hold the write lock?
+			 */
+			atomic_set(&em->need_wake_task, 1);
+		}
+
+bail:
+		/* remove policy->cpus fromm update_cpus */
+		cpumask_andnot(update_cpus, update_cpus, policy->cpus);
+		cpufreq_cpu_put(policy);
+	}
+
+	return;
+}
+
+static void em_start(struct cpufreq_policy *policy)
+{
+	int index = 0, count = 0;
+	unsigned int capacity;
+	struct em_data *em;
+	struct cpufreq_frequency_table *pos;
+
+	/* prepare per-policy private data */
+	em = kzalloc(sizeof(*em), GFP_KERNEL);
+	if (!em) {
+		pr_debug("%s: failed to allocate private data\n", __func__);
+		return;
+	}
+
+	policy->gov_data = em;
+
+	/* how many entries in the frequency table? */
+	cpufreq_for_each_entry(pos, policy->freq_table)
+		count++;
+
+	/* pre-compute thresholds */
+	em->up_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL);
+	em->down_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL);
+
+	cpufreq_for_each_entry(pos, policy->freq_table) {
+		/* FIXME capacity below is not scaled for uarch */
+		capacity = pos->frequency * SCHED_CAPACITY_SCALE / policy->max;
+		em->up_threshold[index] = capacity * UP_THRESHOLD / 100;
+		em->down_threshold[index] = capacity * DOWN_THRESHOLD / 100;
+		pr_debug("%s: cpu = %u index = %d capacity = %u up = %u down = %u\n",
+				__func__, cpumask_first(policy->cpus), index,
+				capacity, em->up_threshold[index],
+				em->down_threshold[index]);
+		index++;
+	}
+
+	/* init per-policy kthread */
+	em->task = kthread_create(energy_model_thread, policy, "kenergy_model_task");
+	if (IS_ERR_OR_NULL(em->task))
+		pr_err("%s: failed to create kenergy_model_task thread\n", __func__);
+}
+
+
+static void em_stop(struct cpufreq_policy *policy)
+{
+	struct em_data *em;
+
+	em = policy->gov_data;
+
+	kthread_stop(em->task);
+
+	/* replace with devm counterparts */
+	kfree(em->up_threshold);
+	kfree(em->down_threshold);
+	kfree(em);
+}
+
+static int energy_model_setup(struct cpufreq_policy *policy, unsigned int event)
+{
+	switch (event) {
+		case CPUFREQ_GOV_START:
+			/* Start managing the frequency */
+			em_start(policy);
+			return 0;
+
+		case CPUFREQ_GOV_STOP:
+			em_stop(policy);
+			return 0;
+
+		case CPUFREQ_GOV_LIMITS:	/* unused */
+		case CPUFREQ_GOV_POLICY_INIT:	/* unused */
+		case CPUFREQ_GOV_POLICY_EXIT:	/* unused */
+			break;
+	}
+	return 0;
+}
+
+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL
+static
+#endif
+struct cpufreq_governor cpufreq_gov_energy_model = {
+	.name			= "energy_model",
+	.governor		= energy_model_setup,
+	.owner			= THIS_MODULE,
+};
+
+static int __init energy_model_init(void)
+{
+	return cpufreq_register_governor(&cpufreq_gov_energy_model);
+}
+
+static void __exit energy_model_exit(void)
+{
+	cpufreq_unregister_governor(&cpufreq_gov_energy_model);
+}
+
+/* Try to make this the default governor */
+fs_initcall(energy_model_init);
+
+MODULE_LICENSE("GPL");