mbox series

[v3,00/24] sched: Introduce classes of tasks for load balance

Message ID 20230207051105.11575-1-ricardo.neri-calderon@linux.intel.com
Headers show
Series sched: Introduce classes of tasks for load balance | expand

Message

Ricardo Neri Feb. 7, 2023, 5:10 a.m. UTC
Hi,

This is third version of this patchset. Previous versions can be found
here [1] and here [2]. For brevity, I did not include the cover letter
from the original posting. You can read it here [1].

This patchset depends on a separate series to handle better asym_packing
between SMT cores [3].

For convenience, this patchset and [3] can be retrieved from [4] and are
based on the tip tree as on Feb 6th, 2023.

Changes since v2:

Ionela pointed out that the IPCC performance score was vague. I provided
a clearer definition and guidance on how architectures should implement
support for it.

Ionela mentioned that other architectures or scheduling schemes may want
to use IPC classes differently. I restricted its current use to
asym_packing.

Lukasz raised the issue that hardware may not be ready to support IPC
classes early after boot. I added a new interface that drivers or
enablement code can call to enable the use of IPC classes when ready.

Vincent provided multiple suggestions on how to balance non-SMT and SMT
sched groups. His feedback was incorporated in [3]. As a result, now
IPCC statistics are also used to break ties between fully_busy groups.

Dietmar indicated that real-time nor deadline tasks should influence the
CFS load balancing. I implemented such change. Also, as per his suggestion,
I folded the IPCC statistics into the existing struct sg_lb_stats.

Updated patches: 2, 6, 7, 10, 13, 14, 17
New patches: 9, 20
Unchanged patches: 1, 3, 4, 5, 8, 11, 12, 15, 16, 18, 19, 21, 22, 23, 24

Hopefully, this series is one step closer to be merged.

Thanks in advance for your kind feedback!

BR,
Ricardo

[1]. https://lore.kernel.org/lkml/20220909231205.14009-1-ricardo.neri-calderon@linux.intel.com/
[2]. https://lore.kernel.org/lkml/20221128132100.30253-1-ricardo.neri-calderon@linux.intel.com/
[3]. https://lore.kernel.org/lkml/20230207045838.11243-1-ricardo.neri-calderon@linux.intel.com/
[4]. https://github.com/ricardon/tip/tree/rneri/ipc_classes_v3

Ricardo Neri (24):
  sched/task_struct: Introduce IPC classes of tasks
  sched: Add interfaces for IPC classes
  sched/core: Initialize the IPC class of a new task
  sched/core: Add user_tick as argument to scheduler_tick()
  sched/core: Update the IPC class of the current task
  sched/fair: Collect load-balancing stats for IPC classes
  sched/fair: Compute IPC class scores for load balancing
  sched/fair: Use IPCC stats to break ties between asym_packing sched
    groups
  sched/fair: Use IPCC stats to break ties between fully_busy SMT groups
  sched/fair: Use IPCC scores to select a busiest runqueue
  thermal: intel: hfi: Introduce Intel Thread Director classes
  x86/cpufeatures: Add the Intel Thread Director feature definitions
  thermal: intel: hfi: Store per-CPU IPCC scores
  thermal: intel: hfi: Update the IPC class of the current task
  thermal: intel: hfi: Report the IPC class score of a CPU
  thermal: intel: hfi: Define a default class for unclassified tasks
  thermal: intel: hfi: Enable the Intel Thread Director
  sched/task_struct: Add helpers for IPC classification
  sched/core: Initialize helpers of task classification
  sched/fair: Introduce sched_smt_siblings_idle()
  thermal: intel: hfi: Implement model-specific checks for task
    classification
  x86/cpufeatures: Add feature bit for HRESET
  x86/hreset: Configure history reset
  x86/process: Reset hardware history in context switch

 arch/x86/include/asm/cpufeatures.h       |   2 +
 arch/x86/include/asm/disabled-features.h |   8 +-
 arch/x86/include/asm/hreset.h            |  30 +++
 arch/x86/include/asm/msr-index.h         |   6 +-
 arch/x86/include/asm/topology.h          |   8 +
 arch/x86/kernel/cpu/common.c             |  30 ++-
 arch/x86/kernel/cpu/cpuid-deps.c         |   1 +
 arch/x86/kernel/cpu/scattered.c          |   1 +
 arch/x86/kernel/process_32.c             |   3 +
 arch/x86/kernel/process_64.c             |   3 +
 drivers/thermal/intel/intel_hfi.c        | 242 +++++++++++++++++-
 include/linux/sched.h                    |  24 +-
 include/linux/sched/topology.h           |   6 +
 init/Kconfig                             |  12 +
 kernel/sched/core.c                      |  10 +-
 kernel/sched/fair.c                      | 309 ++++++++++++++++++++++-
 kernel/sched/sched.h                     |  66 +++++
 kernel/sched/topology.c                  |   9 +
 kernel/time/timer.c                      |   2 +-
 19 files changed, 751 insertions(+), 21 deletions(-)
 create mode 100644 arch/x86/include/asm/hreset.h

Comments

Ricardo Neri March 5, 2023, 10:49 p.m. UTC | #1
On Mon, Feb 06, 2023 at 09:10:41PM -0800, Ricardo Neri wrote:
> Hi,
> 
> This is third version of this patchset. Previous versions can be found
> here [1] and here [2]. For brevity, I did not include the cover letter
> from the original posting. You can read it here [1].

Hello! Is there any feedback for this patchset?

Thanks and BR,
Ricardo
Ricardo Neri March 28, 2023, 11:41 p.m. UTC | #2
On Mon, Mar 27, 2023 at 06:42:28PM +0200, Rafael J. Wysocki wrote:
> On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
> <ricardo.neri-calderon@linux.intel.com> wrote:
> >
> > Use Intel Thread Director classification to update the IPC class of a
> > task. Implement the arch_update_ipcc() interface of the scheduler.
> >
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Ionela Voinescu <ionela.voinescu@arm.com>
> > Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
> > Cc: Len Brown <len.brown@intel.com>
> > Cc: Lukasz Luba <lukasz.luba@arm.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Tim C. Chen <tim.c.chen@intel.com>
> > Cc: Valentin Schneider <vschneid@redhat.com>
> > Cc: x86@kernel.org
> > Cc: linux-pm@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> > ---
> > Changes since v2:
> >  * Removed the implementation of arch_has_ipc_classes().
> >
> > Changes since v1:
> >  * Adjusted the result the classification of Intel Thread Director to start
> >    at class 1. Class 0 for the scheduler means that the task is
> >    unclassified.
> >  * Redefined union hfi_thread_feedback_char_msr to ensure all
> >    bit-fields are packed. (PeterZ)
> >  * Removed CONFIG_INTEL_THREAD_DIRECTOR. (PeterZ)
> >  * Shortened the names of the functions that implement IPC classes.
> >  * Removed argument smt_siblings_idle from intel_hfi_update_ipcc().
> >    (PeterZ)
> > ---
> >  arch/x86/include/asm/topology.h   |  6 ++++++
> >  drivers/thermal/intel/intel_hfi.c | 32 +++++++++++++++++++++++++++++++
> >  2 files changed, 38 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> > index 458c891a8273..ffcdac3f398f 100644
> > --- a/arch/x86/include/asm/topology.h
> > +++ b/arch/x86/include/asm/topology.h
> > @@ -227,4 +227,10 @@ void init_freq_invariance_cppc(void);
> >  #define arch_init_invariance_cppc init_freq_invariance_cppc
> >  #endif
> >
> > +#if defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL)
> > +void intel_hfi_update_ipcc(struct task_struct *curr);
> > +
> > +#define arch_update_ipcc intel_hfi_update_ipcc
> > +#endif /* defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL) */
> > +
> >  #endif /* _ASM_X86_TOPOLOGY_H */
> > diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> > index b06021828892..530dcf57e06e 100644
> > --- a/drivers/thermal/intel/intel_hfi.c
> > +++ b/drivers/thermal/intel/intel_hfi.c
> > @@ -72,6 +72,17 @@ union cpuid6_edx {
> >         u32 full;
> >  };
> >
> > +#ifdef CONFIG_IPC_CLASSES
> > +union hfi_thread_feedback_char_msr {
> > +       struct {
> > +               u64     classid : 8;
> > +               u64     __reserved : 55;
> > +               u64     valid : 1;
> > +       } split;
> > +       u64 full;
> > +};
> > +#endif
> > +
> >  /**
> >   * struct hfi_cpu_data - HFI capabilities per CPU
> >   * @perf_cap:          Performance capability
> > @@ -174,6 +185,27 @@ static struct workqueue_struct *hfi_updates_wq;
> >  #ifdef CONFIG_IPC_CLASSES
> >  static int __percpu *hfi_ipcc_scores;
> >
> > +void intel_hfi_update_ipcc(struct task_struct *curr)
> > +{
> > +       union hfi_thread_feedback_char_msr msr;
> > +
> > +       /* We should not be here if ITD is not supported. */
> > +       if (!cpu_feature_enabled(X86_FEATURE_ITD)) {
> > +               pr_warn_once("task classification requested but not supported!");
> > +               return;
> > +       }
> > +
> > +       rdmsrl(MSR_IA32_HW_FEEDBACK_CHAR, msr.full);
> > +       if (!msr.split.valid)
> > +               return;
> > +
> > +       /*
> > +        * 0 is a valid classification for Intel Thread Director. A scheduler
> > +        * IPCC class of 0 means that the task is unclassified. Adjust.
> > +        */
> > +       curr->ipcc = msr.split.classid + 1;
> > +}
> 
> Wouldn't it be better to return the adjusted value from this function
> and let the caller store it where appropriate?
> 
> It doesn't look like it is necessary to pass the task_struct pointer to it.

Judging from this patch alone, yes, it does not make much sense to pass a
task_struct as argument. In patch 21, however, this function uses various
members of task_struct and makes it more convenient to have it as argument,
no?

> 
> > +
> >  static int alloc_hfi_ipcc_scores(void)
> >  {
> >         if (!cpu_feature_enabled(X86_FEATURE_ITD))
> > --
Ricardo Neri March 28, 2023, 11:49 p.m. UTC | #3
On Mon, Mar 27, 2023 at 06:51:33PM +0200, Rafael J. Wysocki wrote:
> On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
> <ricardo.neri-calderon@linux.intel.com> wrote:
> >
> > A task may be unclassified if it has been recently created, spend most of
> > its lifetime sleeping, or hardware has not provided a classification.
> >
> > Most tasks will be eventually classified as scheduler's IPC class 1
> > (HFI class 0). This class corresponds to the capabilities in the legacy,
> > classless, HFI table.
> >
> > IPC class 1 is a reasonable choice until hardware provides an actual
> > classification. Meanwhile, the scheduler will place classes of tasks with
> > higher IPC scores on higher-performance CPUs.
> >
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Ionela Voinescu <ionela.voinescu@arm.com>
> > Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
> > Cc: Len Brown <len.brown@intel.com>
> > Cc: Lukasz Luba <lukasz.luba@arm.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Tim C. Chen <tim.c.chen@intel.com>
> > Cc: Valentin Schneider <vschneid@redhat.com>
> > Cc: x86@kernel.org
> > Cc: linux-pm@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> 
> Fine with me, so
> 
> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Thank you Rafael!
Ricardo Neri March 30, 2023, 2:07 a.m. UTC | #4
On Tue, Mar 28, 2023 at 12:00:58PM +0200, Vincent Guittot wrote:
> On Tue, 7 Feb 2023 at 06:01, Ricardo Neri
> <ricardo.neri-calderon@linux.intel.com> wrote:
> >
> > Compute the joint total (both current and prospective) IPC class score of
> > a scheduling group and the local scheduling group.
> >
> > These IPCC statistics are used during idle load balancing. The candidate
> > scheduling group will have one fewer busy CPU after load balancing. This
> > observation is important for cores with SMT support.
> >
> > The IPCC score of scheduling groups composed of SMT siblings needs to
> > consider that the siblings share CPU resources. When computing the total
> > IPCC score of the scheduling group, divide score of each sibling by the
> > number of busy siblings.
> >
> > Collect IPCC statistics for asym_packing and fully_busy scheduling groups.
> 
> IPCC statistics collect scores of current tasks, so they are
> meaningful only when trying to migrate one of those running tasks.
> Using such score when pulling other tasks is just meaningless. And I
> don't see how you ensure such correct use of ipcc score

Thank you very much for your feedback Vincent!

It is true that the task that is current when collecting statistics may be
different from the task that is current when we are ready to pluck tasks.

Using IPCC scores for load balancing benefits large, long-running tasks
the most. For these tasks, the current task is likely to remain the same
at the two mentioned points in time.

My patchset proposes to use IPCC clases to break ties between otherwise
identical sched groups in update_sd_pick_busiest(). Its use is limited to
asym_packing and fully_busy types. For these types, it is likely that there
will not be tasks wanting to run other than current. need_active_balance()
will return true and we will migrate the current task.

You are correct, by only looking at the current tasks we risk overlooking
other tasks in the queue and the statistics becoming meaningless. A fully
correct solution would need to keep track of the the types of tasks in
all runqueues as they come and go. IMO, the increased complexity of such
approach does not justify the benefit. We give the load balancer extra
information to decide between otherwise identical sched groups using the
IPCC statistics of big tasks.

> 
> > When picking a busiest group, they are used to break ties between otherwise
> > identical groups.
> >
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Ionela Voinescu <ionela.voinescu@arm.com>
> > Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
> > Cc: Len Brown <len.brown@intel.com>
> > Cc: Lukasz Luba <lukasz.luba@arm.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Tim C. Chen <tim.c.chen@intel.com>
> > Cc: Valentin Schneider <vschneid@redhat.com>
> > Cc: x86@kernel.org
> > Cc: linux-pm@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> > ---
> > Changes since v2:
> >  * Also collect IPCC stats for fully_busy sched groups.
> 
> Why have you added fully_busy case ? it's worth explaining the
> rational because there is a lot of change to use the ipcc score of the
> wrong task to take the decision

When deciding between two fully_busy sched groups, update_sd_pick_busiest()
unconditionally selects the given candidate @sg as busiest. With IPCC
scores, what is running a scheduling group becomes important. We now have
extra information to select either of the fully_busy groups. This is done
in patch 9.

(Also note that in [1] I tweak the logic to select fully_busy SMT cores vs
fully_busy non-SMT cores).

> 
> >  * Restrict use of IPCC stats to SD_ASYM_PACKING. (Ionela)
> >  * Handle errors of arch_get_ipcc_score(). (Ionela)
> >
> > Changes since v1:
> >  * Implemented cleanups and reworks from PeterZ. I took all his
> >    suggestions, except the computation of the  IPC score before and after
> >    load balancing. We are computing not the average score, but the *total*.
> >  * Check for the SD_SHARE_CPUCAPACITY to compute the throughput of the SMT
> >    siblings of a physical core.
> >  * Used the new interface names.
> >  * Reworded commit message for clarity.
> > ---
> >  kernel/sched/fair.c | 68 +++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 68 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d773380a95b3..b6165aa8a376 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8901,6 +8901,8 @@ struct sg_lb_stats {
> >         unsigned long min_score; /* Min(score(rq->curr->ipcc)) */
> >         unsigned short min_ipcc; /* Class of the task with the minimum IPCC score in the rq */
> >         unsigned long sum_score; /* Sum(score(rq->curr->ipcc)) */
> > +       long ipcc_score_after; /* Prospective IPCC score after load balancing */
> > +       unsigned long ipcc_score_before; /* IPCC score before load balancing */
> >  #endif
> >  };
> >
> > @@ -9287,6 +9289,62 @@ static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> >         }
> >  }
> >
> > +static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
> > +                                     struct sched_group *sg,
> > +                                     struct lb_env *env)
> > +{
> > +       unsigned long score_on_dst_cpu, before;
> > +       int busy_cpus;
> > +       long after;
> > +
> > +       if (!sched_ipcc_enabled())
> > +               return;
> > +
> > +       /*
> > +        * IPCC scores are only useful during idle load balancing. For now,
> > +        * only asym_packing uses IPCC scores.
> > +        */
> > +       if (!(env->sd->flags & SD_ASYM_PACKING) ||
> > +           env->idle == CPU_NOT_IDLE)
> > +               return;
> > +
> > +       /*
> > +        * IPCC scores are used to break ties only between these types of
> > +        * groups.
> > +        */
> > +       if (sgs->group_type != group_fully_busy &&
> > +           sgs->group_type != group_asym_packing)
> > +               return;
> > +
> > +       busy_cpus = sgs->group_weight - sgs->idle_cpus;
> > +
> > +       /* No busy CPUs in the group. No tasks to move. */
> > +       if (!busy_cpus)
> > +               return;
> > +
> > +       score_on_dst_cpu = arch_get_ipcc_score(sgs->min_ipcc, env->dst_cpu);
> > +
> > +       /*
> > +        * Do not use IPC scores. sgs::ipcc_score_{after, before} will be zero
> > +        * and not used.
> > +        */
> > +       if (IS_ERR_VALUE(score_on_dst_cpu))
> > +               return;
> > +
> > +       before = sgs->sum_score;
> > +       after = before - sgs->min_score;
> 
> IIUC, you assume that you will select the cpu with the min score.
> How do you ensure this ? otherwise all your comparisong are useless

This is relevant for SMT cores. A smaller idle core will help a fully_busy
SMT core by pulling low-IPC work, leaving the full core for high-IPC
work.

We ensure (or anticipate if you will) this because find_busiest_queue()
will select the queue whose current task gets the biggest IPC boost. When
done from big to small cores the IPC boost is negative.

> 
> > +
> > +       /* SMT siblings share throughput. */
> > +       if (busy_cpus > 1 && sg->flags & SD_SHARE_CPUCAPACITY) {
> > +               before /= busy_cpus;
> > +               /* One sibling will become idle after load balance. */
> > +               after /= busy_cpus - 1;
> > +       }
> > +
> > +       sgs->ipcc_score_after = after + score_on_dst_cpu;
> > +       sgs->ipcc_score_before = before;
> 
> I'm not sure to understand why you are computing the sum_score,
> ipcc_score_before and ipcc_score_after ?
> Why is it not sufficient to check if score_on_dst_cpu will be higher
> than current min_score ?

You are right. As the source core becomes idle after load balancing, it is
sufficient to look for the highest score_on_dst_cpu. However, we must also
handle SMT cores.

If the source sched group is a fully_busy SMT core, one of the siblings
will become idle after load balance (my patchset uses IPCC scores for
asym_packing and when balancing the number of idle CPUs from fully_busy).
The remaining non-idle siblings will benefit from the extra throughput.

We calculate ipcc_score_after to find the sched group that would benefit
the most. We calculate ipcc_score_before to break ties of two groups of
identical ipcc_score_after.

Thanks and BR,
Ricardo

[1]. https://lore.kernel.org/lkml/20230207045838.11243-6-ricardo.neri-calderon@linux.intel.com/
Ricardo Neri March 30, 2023, 3:06 a.m. UTC | #5
On Wed, Mar 29, 2023 at 02:13:29PM +0200, Rafael J. Wysocki wrote:
> On Wed, Mar 29, 2023 at 1:31 AM Ricardo Neri
> <ricardo.neri-calderon@linux.intel.com> wrote:
> >
> > On Mon, Mar 27, 2023 at 06:42:28PM +0200, Rafael J. Wysocki wrote:
> > > On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
> > > <ricardo.neri-calderon@linux.intel.com> wrote:
> > > >
> > > > Use Intel Thread Director classification to update the IPC class of a
> > > > task. Implement the arch_update_ipcc() interface of the scheduler.
> > > >
> > > > Cc: Ben Segall <bsegall@google.com>
> > > > Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> > > > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > > > Cc: Ionela Voinescu <ionela.voinescu@arm.com>
> > > > Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > Cc: Len Brown <len.brown@intel.com>
> > > > Cc: Lukasz Luba <lukasz.luba@arm.com>
> > > > Cc: Mel Gorman <mgorman@suse.de>
> > > > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> > > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > > Cc: Tim C. Chen <tim.c.chen@intel.com>
> > > > Cc: Valentin Schneider <vschneid@redhat.com>
> > > > Cc: x86@kernel.org
> > > > Cc: linux-pm@vger.kernel.org
> > > > Cc: linux-kernel@vger.kernel.org
> > > > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> > > > ---
> > > > Changes since v2:
> > > >  * Removed the implementation of arch_has_ipc_classes().
> > > >
> > > > Changes since v1:
> > > >  * Adjusted the result the classification of Intel Thread Director to start
> > > >    at class 1. Class 0 for the scheduler means that the task is
> > > >    unclassified.
> > > >  * Redefined union hfi_thread_feedback_char_msr to ensure all
> > > >    bit-fields are packed. (PeterZ)
> > > >  * Removed CONFIG_INTEL_THREAD_DIRECTOR. (PeterZ)
> > > >  * Shortened the names of the functions that implement IPC classes.
> > > >  * Removed argument smt_siblings_idle from intel_hfi_update_ipcc().
> > > >    (PeterZ)
> > > > ---
> > > >  arch/x86/include/asm/topology.h   |  6 ++++++
> > > >  drivers/thermal/intel/intel_hfi.c | 32 +++++++++++++++++++++++++++++++
> > > >  2 files changed, 38 insertions(+)
> > > >
> > > > diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> > > > index 458c891a8273..ffcdac3f398f 100644
> > > > --- a/arch/x86/include/asm/topology.h
> > > > +++ b/arch/x86/include/asm/topology.h
> > > > @@ -227,4 +227,10 @@ void init_freq_invariance_cppc(void);
> > > >  #define arch_init_invariance_cppc init_freq_invariance_cppc
> > > >  #endif
> > > >
> > > > +#if defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL)
> > > > +void intel_hfi_update_ipcc(struct task_struct *curr);
> > > > +
> > > > +#define arch_update_ipcc intel_hfi_update_ipcc
> > > > +#endif /* defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL) */
> > > > +
> > > >  #endif /* _ASM_X86_TOPOLOGY_H */
> > > > diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> > > > index b06021828892..530dcf57e06e 100644
> > > > --- a/drivers/thermal/intel/intel_hfi.c
> > > > +++ b/drivers/thermal/intel/intel_hfi.c
> > > > @@ -72,6 +72,17 @@ union cpuid6_edx {
> > > >         u32 full;
> > > >  };
> > > >
> > > > +#ifdef CONFIG_IPC_CLASSES
> > > > +union hfi_thread_feedback_char_msr {
> > > > +       struct {
> > > > +               u64     classid : 8;
> > > > +               u64     __reserved : 55;
> > > > +               u64     valid : 1;
> > > > +       } split;
> > > > +       u64 full;
> > > > +};
> > > > +#endif
> > > > +
> > > >  /**
> > > >   * struct hfi_cpu_data - HFI capabilities per CPU
> > > >   * @perf_cap:          Performance capability
> > > > @@ -174,6 +185,27 @@ static struct workqueue_struct *hfi_updates_wq;
> > > >  #ifdef CONFIG_IPC_CLASSES
> > > >  static int __percpu *hfi_ipcc_scores;
> > > >
> > > > +void intel_hfi_update_ipcc(struct task_struct *curr)
> > > > +{
> > > > +       union hfi_thread_feedback_char_msr msr;
> > > > +
> > > > +       /* We should not be here if ITD is not supported. */
> > > > +       if (!cpu_feature_enabled(X86_FEATURE_ITD)) {
> > > > +               pr_warn_once("task classification requested but not supported!");
> > > > +               return;
> > > > +       }
> > > > +
> > > > +       rdmsrl(MSR_IA32_HW_FEEDBACK_CHAR, msr.full);
> > > > +       if (!msr.split.valid)
> > > > +               return;
> > > > +
> > > > +       /*
> > > > +        * 0 is a valid classification for Intel Thread Director. A scheduler
> > > > +        * IPCC class of 0 means that the task is unclassified. Adjust.
> > > > +        */
> > > > +       curr->ipcc = msr.split.classid + 1;
> > > > +}
> > >
> > > Wouldn't it be better to return the adjusted value from this function
> > > and let the caller store it where appropriate?
> > >
> > > It doesn't look like it is necessary to pass the task_struct pointer to it.
> >
> > Judging from this patch alone, yes, it does not make much sense to pass a
> > task_struct as argument. In patch 21, however, this function uses various
> > members of task_struct and makes it more convenient to have it as argument,
> > no?
> 
> I'm not convinced about this, but anyway it is better to combine the
> two patches in such cases IMO.
> 
> The way it is done now confuses things from my perspective.

Right, I structured the patchset to have the additions to task_struct in
separate patches. This made it less clear for intel_hfi.c

Would it be acceptable if I kept void intel_hfi_update_ipcc(struct
task_struct *curr) and added a static function u32 intel_hfi_get_ipcc(void)
to return the hardware classification?

Otherwise, I would need to add three different accessors for task_struct
so that the HFI driver can retrieve the auxiliary members from patch 21.

Thanks and BR,
Ricardo
Vincent Guittot March 31, 2023, 12:20 p.m. UTC | #6
On Thu, 30 Mar 2023 at 03:56, Ricardo Neri
<ricardo.neri-calderon@linux.intel.com> wrote:
>
> On Tue, Mar 28, 2023 at 12:00:58PM +0200, Vincent Guittot wrote:
> > On Tue, 7 Feb 2023 at 06:01, Ricardo Neri
> > <ricardo.neri-calderon@linux.intel.com> wrote:
> > >
> > > Compute the joint total (both current and prospective) IPC class score of
> > > a scheduling group and the local scheduling group.
> > >
> > > These IPCC statistics are used during idle load balancing. The candidate
> > > scheduling group will have one fewer busy CPU after load balancing. This
> > > observation is important for cores with SMT support.
> > >
> > > The IPCC score of scheduling groups composed of SMT siblings needs to
> > > consider that the siblings share CPU resources. When computing the total
> > > IPCC score of the scheduling group, divide score of each sibling by the
> > > number of busy siblings.
> > >
> > > Collect IPCC statistics for asym_packing and fully_busy scheduling groups.
> >
> > IPCC statistics collect scores of current tasks, so they are
> > meaningful only when trying to migrate one of those running tasks.
> > Using such score when pulling other tasks is just meaningless. And I
> > don't see how you ensure such correct use of ipcc score
>
> Thank you very much for your feedback Vincent!
>
> It is true that the task that is current when collecting statistics may be
> different from the task that is current when we are ready to pluck tasks.
>
> Using IPCC scores for load balancing benefits large, long-running tasks
> the most. For these tasks, the current task is likely to remain the same
> at the two mentioned points in time.

My point was mainly about the fact that the current running task is
the last one to be pulled. And this happens only when no other task
was pulled otherwise.

>
> My patchset proposes to use IPCC clases to break ties between otherwise
> identical sched groups in update_sd_pick_busiest(). Its use is limited to
> asym_packing and fully_busy types. For these types, it is likely that there
> will not be tasks wanting to run other than current. need_active_balance()
> will return true and we will migrate the current task.

I disagree with your assumption above, asym_packing and fully_busy
types doesn't put any mean on the number of running tasks

>
> You are correct, by only looking at the current tasks we risk overlooking
> other tasks in the queue and the statistics becoming meaningless. A fully
> correct solution would need to keep track of the the types of tasks in
> all runqueues as they come and go. IMO, the increased complexity of such
> approach does not justify the benefit. We give the load balancer extra
> information to decide between otherwise identical sched groups using the
> IPCC statistics of big tasks.

because IPCC are meaningful only when there is only 1 running task and
during active migration, you should collect them only for such
situation

>
> >
> > > When picking a busiest group, they are used to break ties between otherwise
> > > identical groups.
> > >
> > > Cc: Ben Segall <bsegall@google.com>
> > > Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> > > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > > Cc: Ionela Voinescu <ionela.voinescu@arm.com>
> > > Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > Cc: Len Brown <len.brown@intel.com>
> > > Cc: Lukasz Luba <lukasz.luba@arm.com>
> > > Cc: Mel Gorman <mgorman@suse.de>
> > > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > Cc: Tim C. Chen <tim.c.chen@intel.com>
> > > Cc: Valentin Schneider <vschneid@redhat.com>
> > > Cc: x86@kernel.org
> > > Cc: linux-pm@vger.kernel.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> > > ---
> > > Changes since v2:
> > >  * Also collect IPCC stats for fully_busy sched groups.
> >
> > Why have you added fully_busy case ? it's worth explaining the
> > rational because there is a lot of change to use the ipcc score of the
> > wrong task to take the decision
>
> When deciding between two fully_busy sched groups, update_sd_pick_busiest()
> unconditionally selects the given candidate @sg as busiest. With IPCC
> scores, what is running a scheduling group becomes important. We now have
> extra information to select either of the fully_busy groups. This is done
> in patch 9.
>
> (Also note that in [1] I tweak the logic to select fully_busy SMT cores vs
> fully_busy non-SMT cores).
>
> >
> > >  * Restrict use of IPCC stats to SD_ASYM_PACKING. (Ionela)
> > >  * Handle errors of arch_get_ipcc_score(). (Ionela)
> > >
> > > Changes since v1:
> > >  * Implemented cleanups and reworks from PeterZ. I took all his
> > >    suggestions, except the computation of the  IPC score before and after
> > >    load balancing. We are computing not the average score, but the *total*.
> > >  * Check for the SD_SHARE_CPUCAPACITY to compute the throughput of the SMT
> > >    siblings of a physical core.
> > >  * Used the new interface names.
> > >  * Reworded commit message for clarity.
> > > ---
> > >  kernel/sched/fair.c | 68 +++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 68 insertions(+)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index d773380a95b3..b6165aa8a376 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8901,6 +8901,8 @@ struct sg_lb_stats {
> > >         unsigned long min_score; /* Min(score(rq->curr->ipcc)) */
> > >         unsigned short min_ipcc; /* Class of the task with the minimum IPCC score in the rq */
> > >         unsigned long sum_score; /* Sum(score(rq->curr->ipcc)) */
> > > +       long ipcc_score_after; /* Prospective IPCC score after load balancing */
> > > +       unsigned long ipcc_score_before; /* IPCC score before load balancing */
> > >  #endif
> > >  };
> > >
> > > @@ -9287,6 +9289,62 @@ static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> > >         }
> > >  }
> > >
> > > +static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
> > > +                                     struct sched_group *sg,
> > > +                                     struct lb_env *env)
> > > +{
> > > +       unsigned long score_on_dst_cpu, before;
> > > +       int busy_cpus;
> > > +       long after;
> > > +
> > > +       if (!sched_ipcc_enabled())
> > > +               return;
> > > +
> > > +       /*
> > > +        * IPCC scores are only useful during idle load balancing. For now,
> > > +        * only asym_packing uses IPCC scores.
> > > +        */
> > > +       if (!(env->sd->flags & SD_ASYM_PACKING) ||
> > > +           env->idle == CPU_NOT_IDLE)
> > > +               return;
> > > +
> > > +       /*
> > > +        * IPCC scores are used to break ties only between these types of
> > > +        * groups.
> > > +        */
> > > +       if (sgs->group_type != group_fully_busy &&
> > > +           sgs->group_type != group_asym_packing)
> > > +               return;
> > > +
> > > +       busy_cpus = sgs->group_weight - sgs->idle_cpus;
> > > +
> > > +       /* No busy CPUs in the group. No tasks to move. */
> > > +       if (!busy_cpus)
> > > +               return;
> > > +
> > > +       score_on_dst_cpu = arch_get_ipcc_score(sgs->min_ipcc, env->dst_cpu);
> > > +
> > > +       /*
> > > +        * Do not use IPC scores. sgs::ipcc_score_{after, before} will be zero
> > > +        * and not used.
> > > +        */
> > > +       if (IS_ERR_VALUE(score_on_dst_cpu))
> > > +               return;
> > > +
> > > +       before = sgs->sum_score;
> > > +       after = before - sgs->min_score;
> >
> > IIUC, you assume that you will select the cpu with the min score.
> > How do you ensure this ? otherwise all your comparisong are useless
>
> This is relevant for SMT cores. A smaller idle core will help a fully_busy
> SMT core by pulling low-IPC work, leaving the full core for high-IPC
> work.
>
> We ensure (or anticipate if you will) this because find_busiest_queue()
> will select the queue whose current task gets the biggest IPC boost. When
> done from big to small cores the IPC boost is negative.
>
> >
> > > +
> > > +       /* SMT siblings share throughput. */
> > > +       if (busy_cpus > 1 && sg->flags & SD_SHARE_CPUCAPACITY) {
> > > +               before /= busy_cpus;
> > > +               /* One sibling will become idle after load balance. */
> > > +               after /= busy_cpus - 1;
> > > +       }
> > > +
> > > +       sgs->ipcc_score_after = after + score_on_dst_cpu;
> > > +       sgs->ipcc_score_before = before;
> >
> > I'm not sure to understand why you are computing the sum_score,
> > ipcc_score_before and ipcc_score_after ?
> > Why is it not sufficient to check if score_on_dst_cpu will be higher
> > than current min_score ?
>
> You are right. As the source core becomes idle after load balancing, it is
> sufficient to look for the highest score_on_dst_cpu. However, we must also
> handle SMT cores.
>
> If the source sched group is a fully_busy SMT core, one of the siblings
> will become idle after load balance (my patchset uses IPCC scores for
> asym_packing and when balancing the number of idle CPUs from fully_busy).
> The remaining non-idle siblings will benefit from the extra throughput.
>
> We calculate ipcc_score_after to find the sched group that would benefit
> the most. We calculate ipcc_score_before to break ties of two groups of
> identical ipcc_score_after.
>
> Thanks and BR,
> Ricardo
>
> [1]. https://lore.kernel.org/lkml/20230207045838.11243-6-ricardo.neri-calderon@linux.intel.com/
Ricardo Neri April 17, 2023, 10:52 p.m. UTC | #7
On Fri, Mar 31, 2023 at 02:20:11PM +0200, Vincent Guittot wrote:
> On Thu, 30 Mar 2023 at 03:56, Ricardo Neri
> <ricardo.neri-calderon@linux.intel.com> wrote:
> >
> > On Tue, Mar 28, 2023 at 12:00:58PM +0200, Vincent Guittot wrote:
> > > On Tue, 7 Feb 2023 at 06:01, Ricardo Neri
> > > <ricardo.neri-calderon@linux.intel.com> wrote:
> > > >
> > > > Compute the joint total (both current and prospective) IPC class score of
> > > > a scheduling group and the local scheduling group.
> > > >
> > > > These IPCC statistics are used during idle load balancing. The candidate
> > > > scheduling group will have one fewer busy CPU after load balancing. This
> > > > observation is important for cores with SMT support.
> > > >
> > > > The IPCC score of scheduling groups composed of SMT siblings needs to
> > > > consider that the siblings share CPU resources. When computing the total
> > > > IPCC score of the scheduling group, divide score of each sibling by the
> > > > number of busy siblings.
> > > >
> > > > Collect IPCC statistics for asym_packing and fully_busy scheduling groups.
> > >
> > > IPCC statistics collect scores of current tasks, so they are
> > > meaningful only when trying to migrate one of those running tasks.
> > > Using such score when pulling other tasks is just meaningless. And I
> > > don't see how you ensure such correct use of ipcc score
> >
> > Thank you very much for your feedback Vincent!
> >
> > It is true that the task that is current when collecting statistics may be
> > different from the task that is current when we are ready to pluck tasks.
> >
> > Using IPCC scores for load balancing benefits large, long-running tasks
> > the most. For these tasks, the current task is likely to remain the same
> > at the two mentioned points in time.
> 
> My point was mainly about the fact that the current running task is
> the last one to be pulled. And this happens only when no other task
> was pulled otherwise.

(Thanks again for your feedback, Vincent. I am sorry for the late reply. I
needed some more time to think about it.)

Good point! It is smarter to compare and pull from the back of the queue,
rather than comparing curr and pulling from the back. We are more likely
to break the tie correctly without being too complex.

Here is an incremental patch with the update. I'll include this change in
my next version.

@@ -9281,24 +9281,42 @@ static void init_rq_ipcc_stats(struct sg_lb_stats *sgs)
 	sgs->min_score = ULONG_MAX;
 }
 
+static int rq_last_task_ipcc(int dst_cpu, struct rq *rq, unsigned short *ipcc)
+{
+	struct list_head *tasks = &rq->cfs_tasks;
+	struct task_struct *p;
+	struct rq_flags rf;
+	int ret = -EINVAL;
+
+	rq_lock_irqsave(rq, &rf);
+	if (list_empty(tasks))
+		goto out;
+
+	p = list_last_entry(tasks, struct task_struct, se.group_node);
+	if (p->flags & PF_EXITING || is_idle_task(p) ||
+	    !cpumask_test_cpu(dst_cpu, p->cpus_ptr))
+		goto out;
+
+	ret = 0;
+	*ipcc = p->ipcc;
+out:
+	rq_unlock(rq, &rf);
+	return ret;
+}
+
 /* Called only if cpu_of(@rq) is not idle and has tasks running. */
 static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
 				    struct rq *rq)
 {
-	struct task_struct *curr;
 	unsigned short ipcc;
 	unsigned long score;
 
 	if (!sched_ipcc_enabled())
 		return;
 
-	curr = rcu_dereference(rq->curr);
-	if (!curr || (curr->flags & PF_EXITING) || is_idle_task(curr) ||
-	    task_is_realtime(curr) ||
-	    !cpumask_test_cpu(dst_cpu, curr->cpus_ptr))
+	if (rq_last_task_ipcc(dst_cpu, rq, &ipcc))
 		return;
 
-	ipcc = curr->ipcc;
 	score = arch_get_ipcc_score(ipcc, cpu_of(rq));
 
> 
> >
> > My patchset proposes to use IPCC clases to break ties between otherwise
> > identical sched groups in update_sd_pick_busiest(). Its use is limited to
> > asym_packing and fully_busy types. For these types, it is likely that there
> > will not be tasks wanting to run other than current. need_active_balance()
> > will return true and we will migrate the current task.
> 
> I disagree with your assumption above, asym_packing and fully_busy
> types doesn't put any mean on the number of running tasks

Agreed. What I stated was not correct.

o> 
> >
> > You are correct, by only looking at the current tasks we risk overlooking
> > other tasks in the queue and the statistics becoming meaningless. A fully
> > correct solution would need to keep track of the the types of tasks in
> > all runqueues as they come and go. IMO, the increased complexity of such
> > approach does not justify the benefit. We give the load balancer extra
> > information to decide between otherwise identical sched groups using the
> > IPCC statistics of big tasks.
> 
> because IPCC are meaningful only when there is only 1 running task and
> during active migration, you should collect them only for such
> situation

I think that if we compute the IPCC statistics using the tasks at the back
of the runqueue, then IPCC statistics remain meaningful for nr_running >= 1.