diff mbox series

[v2,21/38] x86/cpu: Provide cpu_init/parse_topology()

Message ID 20230728120930.839913695@linutronix.de
State New
Headers show
Series x86/cpu: Rework the topology evaluation | expand

Commit Message

Thomas Gleixner July 28, 2023, 12:13 p.m. UTC
Topology evaluation is a complete disaster and impenetrable mess. It's
scattered all over the place with some vendor implementatins doing early
evaluation and some not. The most horrific part is the permanent
overwriting of smt_max_siblings and __max_die_per_package, instead of
establishing them once on the boot CPU and validating the result on the
APs.

The goals are:

  - One topology evaluation entry point

  - Proper sharing of pointlessly duplicated code

  - Proper structuring of the evaluation logic and preferences.

  - Evaluating important system wide information only once on the boot CPU

  - Making the 0xb/0x1f leaf parsing less convoluted and actually fixing
    the short comings of leaf 0x1f evaluation.

Start to consolidate the topology evaluation code by providing the entry
points for the early boot CPU evaluation and for the final parsing on the
boot CPU and the APs.

Move the trivial pieces into that new code:

   - The initialization of cpuinfo_x86::topo

   - The evaluation of CPUID leaf 1, which presets topo::initial_apicid

   - topo_apicid is set to topo::initial_apicid when invoked from early
     boot. When invoked for the final evaluation on the boot CPU it reads
     the actual APIC ID, which makes apic_get_initial_apicid() obsolete
     once everything is converted over.

Provide a temporary helper function topo_converted() which shields off the
not yet converted CPU vendors from invoking code which would break them.
This shielding covers all vendor CPUs which support SMP, but not the
historical pure UP ones as they only need the topology info init and
eventually the initial APIC initialization.

Provide two new members in cpuinfo_x86::topo to store the maximum number of
SMT siblings and the number of dies per package and add them to the debugfs
readout. These two members will be used to populate this information on the
boot CPU and to validate the APs against it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/topology.h       |   19 +++
 arch/x86/kernel/cpu/Makefile          |    3 
 arch/x86/kernel/cpu/common.c          |   23 +---
 arch/x86/kernel/cpu/cpu.h             |    6 +
 arch/x86/kernel/cpu/debugfs.c         |   37 ++++++
 arch/x86/kernel/cpu/topology.h        |   32 +++++
 arch/x86/kernel/cpu/topology_common.c |  187 ++++++++++++++++++++++++++++++++++
 7 files changed, 290 insertions(+), 17 deletions(-)

Comments

Michael Kelley July 31, 2023, 4:05 a.m. UTC | #1
From: Thomas Gleixner <tglx@linutronix.de> Sent: Friday, July 28, 2023 5:13 AM
> 
> Topology evaluation is a complete disaster and impenetrable mess. It's
> scattered all over the place with some vendor implementatins doing early
> evaluation and some not. The most horrific part is the permanent
> overwriting of smt_max_siblings and __max_die_per_package, instead of
> establishing them once on the boot CPU and validating the result on the
> APs.
> 
> The goals are:
> 
>   - One topology evaluation entry point
> 
>   - Proper sharing of pointlessly duplicated code
> 
>   - Proper structuring of the evaluation logic and preferences.
> 
>   - Evaluating important system wide information only once on the boot CPU
> 
>   - Making the 0xb/0x1f leaf parsing less convoluted and actually fixing
>     the short comings of leaf 0x1f evaluation.
> 
> Start to consolidate the topology evaluation code by providing the entry
> points for the early boot CPU evaluation and for the final parsing on the
> boot CPU and the APs.
> 
> Move the trivial pieces into that new code:
> 
>    - The initialization of cpuinfo_x86::topo
> 
>    - The evaluation of CPUID leaf 1, which presets topo::initial_apicid
> 
>    - topo_apicid is set to topo::initial_apicid when invoked from early
>      boot. When invoked for the final evaluation on the boot CPU it reads
>      the actual APIC ID,  which makes apic_get_initial_apicid() obsolete
>      once everything is converted over.

> 
> Provide a temporary helper function topo_converted() which shields off the
> not yet converted CPU vendors from invoking code which would break them.
> This shielding covers all vendor CPUs which support SMP, but not the
> historical pure UP ones as they only need the topology info init and
> eventually the initial APIC initialization.
> 
> Provide two new members in cpuinfo_x86::topo to store the maximum number of
> SMT siblings and the number of dies per package and add them to the debugfs
> readout. These two members will be used to populate this information on the
> boot CPU and to validate the APs against it.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/include/asm/topology.h       |   19 +++
>  arch/x86/kernel/cpu/Makefile          |    3
>  arch/x86/kernel/cpu/common.c          |   23 +---
>  arch/x86/kernel/cpu/cpu.h             |    6 +
>  arch/x86/kernel/cpu/debugfs.c         |   37 ++++++
>  arch/x86/kernel/cpu/topology.h        |   32 +++++
>  arch/x86/kernel/cpu/topology_common.c |  187
> ++++++++++++++++++++++++++++++++++
>  7 files changed, 290 insertions(+), 17 deletions(-)
> 

[snip]

> +
> +static void parse_topology(struct topo_scan *tscan, bool early)
> +{
> +	const struct cpuinfo_topology topo_defaults = {
> +		.cu_id			= 0xff,
> +		.llc_id			= BAD_APICID,
> +		.l2c_id			= BAD_APICID,
> +	};
> +	struct cpuinfo_x86 *c = tscan->c;
> +	struct {
> +		u32	unused0		: 16,
> +			nproc		:  8,
> +			apicid		:  8;
> +	} ebx;
> +
> +	c->topo = topo_defaults;
> +
> +	if (fake_topology(tscan))
> +	    return;
> +
> +	/* Preset Initial APIC ID from CPUID leaf 1 */
> +	cpuid_leaf_reg(1, CPUID_EBX, &ebx);
> +	c->topo.initial_apicid = ebx.apicid;
> +
> +	/*
> +	 * The initial invocation from early_identify_cpu() happens before
> +	 * the APIC is mapped or X2APIC enabled. For establishing the
> +	 * topology, that's not required. Use the initial APIC ID.
> +	 */
> +	if (early)
> +		c->topo.apicid = c->topo.initial_apicid;
> +	else
> +		c->topo.apicid = read_apic_id();

Using the value from the local APIC ID reg turns out to cause a problem in
some Hyper-V VM configurations.  If a VM has multiple L3 caches (probably
due to multiple NUMA nodes) and the # of CPUs in the span of the L3 cache
is not a power of 2, the APIC IDs for the CPUs in the span of the 1st L3 cache
are sequential starting with 0.  But then there is a gap before starting the
APIC IDs for the CPUs in the span of the 2nd L3 cache.  The gap is
repeated if there are additional L3 caches.

The CPUID instruction executed on a guest vCPU correctly reports the APIC
IDs.  However, the ACPI MADT assigns the APIC IDs sequentially with no
gaps, and the guest firmware sets the APIC_ID register for each local APIC
to match the MADT.  When parse_topology() sets the apicid field based on
reading the local APIC ID register, the value it sets is different from the
initial_apicid value for CPUs in the span of the 2nd and subsequent L3
caches, because there's no gap in the APIC IDs read from the local APIC.
Linux boots and runs, but the topology is set up with the wrong span for
the L3 cache and for the associated scheduling domains. 

The old code derives the apicid from the initial_apicid via the
phys_pkg_id() callback, so these bad Hyper-V VM configs skate by.  The
wrong value in the local APIC ID register and MADT does not affect
anything, except that the check in validate_apic_and_package_id() fails
during boot, and a set of "Firmware bug:" messages is correctly output.

Three thoughts:

1)  Are Hyper-V VMs the only place where the local APIC ID register might
have a bogus value?  Probably so, but you never know what might crawl out.

2) The natural response is "Well, fix Hyper-V!"  I first had this conversation
with the Hyper-V team about 5 years ago.  Some cases of the problem were
fixed, but some cases remain unfixed.  It's a long story.

3)  Since Hyper-V code in Linux already has an override for the apic->read()
function, it's possible to do a hack in that override so that apicid gets set to
the same value as initial_apicid, which matches the old code.  Here's the diff:

diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
index 72d9931da3a2..2e7b18557186 100644
--- a/arch/x86/hyperv/hv_apic.c
+++ b/arch/x86/hyperv/hv_apic.c
@@ -58,6 +58,8 @@ static u32 hv_apic_read(u32 reg)
        u32 reg_val, hi;

        switch (reg) {
+       case APIC_ID:
+               return __this_cpu_read(cpu_info.topo.initial_apicid) << 24;
        case APIC_EOI:
                rdmsr(HV_X64_MSR_EOI, reg_val, hi);
                (void)hi;
@@ -311,6 +313,7 @@ void __init hv_apic_init(void)
                 * both xapic and x2apic because the field layout is the same.
                 */
                apic_update_callback(eoi, hv_apic_eoi_write);
+               apic->apic_id_registered = NULL;
                if (!x2apic_enabled()) {
                        apic_update_callback(read, hv_apic_read);
                        apic_update_callback(write, hv_apic_write);

Setting apic->apic_id_registered to NULL is necessary because it does
read_apic_id() and checks that the value matches an APIC ID that was
registered when the MADT was parsed.  This test fails for some vCPUs
in the VM because the APIC IDs from the MADT are also sequential
with no gaps as mentioned above. I don't see any big hazard in
bypassing the check.

The hv_apic_read() override is used only in VMs with an xapic.
I still need to check a few things, but I believe Hyper-V gets
MADT and local APIC ID reg numbering correct when an x2apic
is used, so I don't think any hacks are needed for that path.

Does anyone have suggestions on a different way to handle
this that's better than the above diff?  Other thoughts?

Michael
Thomas Gleixner July 31, 2023, 12:34 p.m. UTC | #2
On Mon, Jul 31 2023 at 04:05, Michael Kelley wrote:
>> +	/*
>> +	 * The initial invocation from early_identify_cpu() happens before
>> +	 * the APIC is mapped or X2APIC enabled. For establishing the
>> +	 * topology, that's not required. Use the initial APIC ID.
>> +	 */
>> +	if (early)
>> +		c->topo.apicid = c->topo.initial_apicid;
>> +	else
>> +		c->topo.apicid = read_apic_id();
>
> Using the value from the local APIC ID reg turns out to cause a problem in
> some Hyper-V VM configurations.  If a VM has multiple L3 caches (probably
> due to multiple NUMA nodes) and the # of CPUs in the span of the L3 cache
> is not a power of 2, the APIC IDs for the CPUs in the span of the 1st L3 cache
> are sequential starting with 0.  But then there is a gap before starting the
> APIC IDs for the CPUs in the span of the 2nd L3 cache.  The gap is
> repeated if there are additional L3 caches.
>
> The CPUID instruction executed on a guest vCPU correctly reports the APIC
> IDs.  However, the ACPI MADT assigns the APIC IDs sequentially with no
> gaps, and the guest firmware sets the APIC_ID register for each local APIC
> to match the MADT.  When parse_topology() sets the apicid field based on
> reading the local APIC ID register, the value it sets is different from the
> initial_apicid value for CPUs in the span of the 2nd and subsequent L3
> caches, because there's no gap in the APIC IDs read from the local APIC.
> Linux boots and runs, but the topology is set up with the wrong span for
> the L3 cache and for the associated scheduling domains.

TBH. That's an insanity. MADT and the actual APIC ID determine the
topology. So the gaps should be reflected in MADT and the actual APIC
IDs should be set correctly if the intent is to provide topology
information.

Just for the record. This hack works only on Intel today, because AMD
init sets topo.apicid = read_apic_id() unconditionally. So this is
inconsistent already, no?

> The old code derives the apicid from the initial_apicid via the
> phys_pkg_id() callback, so these bad Hyper-V VM configs skate by.  The
> wrong value in the local APIC ID register and MADT does not affect
> anything, except that the check in validate_apic_and_package_id() fails
> during boot, and a set of "Firmware bug:" messages is correctly
> output.

So instead of fixing the firmware bugs, hyper-v just moves on and
pretends that everything works fine, right?

> Three thoughts:
>
> 1)  Are Hyper-V VMs the only place where the local APIC ID register might
> have a bogus value?  Probably so, but you never know what might crawl
> out.

Define bogus. MADT is the primary source of information because that's
how we know how many CPUs (APICs) are there and what their APIC ID is
which we can use to wake them up. So there is a reasonable expectation
that this information is consistent with the rest of the system.

The Intel SDM clearly says in Vol 3A section 9.4.5 Identifying Logical
Processors in an MP System:

  "After the BIOS has completed the MP initialization protocol, each
   logical processor can be uniquely identified by its local APIC
   ID. Software can access these APIC IDs in either of the following
   ways:"

These ways include read from APIC, read MADT, read CPUID and implies
that this must be consistent. For X2APIC it's actually written out:

  "If the local APIC unit supports x2APIC and is operating in x2APIC
   mode, 32-bit APIC ID can be read by executing a RDMSR instruction to
   read the processor’s x2APIC ID register. This method is equivalent to
   executing CPUID leaf 0BH described below."

AMD has not been following that in the early 64bit systems as they moved
the APIC ID space to start at 32 for the first CPU in the first socket
for whatever reasons. But since then the kernel reads back the APIC ID
on AMD systems into topo.apicid. But that was long ago and can easily be
dealt with because at least the real APIC ID and the MADT/MPTABLE
entries are consistent.

Hypervisors have their own CPUID space to override functionality with
their own magic stuff, but imposing their nutbolt ideas on the
architectural part of the system is not only wrong, it's disrespectful
against the OS developers who try to keep their system sane.

> 2) The natural response is "Well, fix Hyper-V!"  I first had this conversation
> with the Hyper-V team about 5 years ago.  Some cases of the problem were
> fixed, but some cases remain unfixed.  It's a long story.
>
> 3)  Since Hyper-V code in Linux already has an override for the apic->read()
> function, it's possible to do a hack in that override so that apicid gets set to
> the same value as initial_apicid, which matches the old code.  Here's the diff:

This collides massively with the other work I'm doing, which uses the
MADT provided information to actually evaluate various topology related
things upfront and later during bringup. Thats badly needed because lots
of todays infrastructure is based on heuristics and guesswork.

But it seems I wasted a month on reworking all of this just to be
stopped cold in the tracks by completely undocumented and unnecessary
hyper-v abuse.

So if Hyper-V insists on abusing the initial APIC ID as read from CPUID
for topology information related to L3, then hyper-v should override the
cache topology mechanism and not impose this insanity on the basic
topology evaluation infrastructure.

Yours seriously grumpy

      tglx
Peter Zijlstra July 31, 2023, 1:27 p.m. UTC | #3
On Mon, Jul 31, 2023 at 02:34:39PM +0200, Thomas Gleixner wrote:

> This collides massively with the other work I'm doing, which uses the
> MADT provided information to actually evaluate various topology related
> things upfront and later during bringup. Thats badly needed because lots
> of todays infrastructure is based on heuristics and guesswork.
> 
> But it seems I wasted a month on reworking all of this just to be
> stopped cold in the tracks by completely undocumented and unnecessary
> hyper-v abuse.
> 
> So if Hyper-V insists on abusing the initial APIC ID as read from CPUID
> for topology information related to L3, then hyper-v should override the
> cache topology mechanism and not impose this insanity on the basic
> topology evaluation infrastructure.

So I'm very tempted to suggest you continue with the topology rewrite
and let Hyper-V keep the pieces. They're very clearly violating the SDM.

Thing as they stand are untenable, the whole topology thing as it exists
today is an untenable shitshow.

Michael, is there anything you can do early (as in MADT parse early) to
fix up the APIC-IDs?
Michael Kelley July 31, 2023, 4:25 p.m. UTC | #4
From: Thomas Gleixner <tglx@linutronix.de> Sent: Monday, July 31, 2023 5:35 AM
> 
> On Mon, Jul 31 2023 at 04:05, Michael Kelley wrote:
> >> +	/*
> >> +	 * The initial invocation from early_identify_cpu() happens before
> >> +	 * the APIC is mapped or X2APIC enabled. For establishing the
> >> +	 * topology, that's not required. Use the initial APIC ID.
> >> +	 */
> >> +	if (early)
> >> +		c->topo.apicid = c->topo.initial_apicid;
> >> +	else
> >> +		c->topo.apicid = read_apic_id();
> >
> > Using the value from the local APIC ID reg turns out to cause a problem in
> > some Hyper-V VM configurations.  If a VM has multiple L3 caches (probably
> > due to multiple NUMA nodes) and the # of CPUs in the span of the L3 cache
> > is not a power of 2, the APIC IDs for the CPUs in the span of the 1st L3 cache
> > are sequential starting with 0.  But then there is a gap before starting the
> > APIC IDs for the CPUs in the span of the 2nd L3 cache.  The gap is
> > repeated if there are additional L3 caches.
> >
> > The CPUID instruction executed on a guest vCPU correctly reports the APIC
> > IDs.  However, the ACPI MADT assigns the APIC IDs sequentially with no
> > gaps, and the guest firmware sets the APIC_ID register for each local APIC
> > to match the MADT.  When parse_topology() sets the apicid field based on
> > reading the local APIC ID register, the value it sets is different from the
> > initial_apicid value for CPUs in the span of the 2nd and subsequent L3
> > caches, because there's no gap in the APIC IDs read from the local APIC.
> > Linux boots and runs, but the topology is set up with the wrong span for
> > the L3 cache and for the associated scheduling domains.
> 
> TBH. That's an insanity. MADT and the actual APIC ID determine the
> topology. So the gaps should be reflected in MADT and the actual APIC
> IDs should be set correctly if the intent is to provide topology
> information.
> 
> Just for the record. This hack works only on Intel today, because AMD
> init sets topo.apicid = read_apic_id() unconditionally. So this is
> inconsistent already, no?
> 

Correct.  But given that the L3 cache span in the AMD Zen1 and Zen2
processors is only 8 CPUs, there's much less reason to configure a VM
that only uses some of the CPUs in an L3 cache span.  Hyper-V does
the APIC ID numbering correctly for Zen3 with its 16 CPUs in the L3
cache span.

> > The old code derives the apicid from the initial_apicid via the
> > phys_pkg_id() callback, so these bad Hyper-V VM configs skate by.  The
> > wrong value in the local APIC ID register and MADT does not affect
> > anything, except that the check in validate_apic_and_package_id() fails
> > during boot, and a set of "Firmware bug:" messages is correctly
> > output.
> 
> So instead of fixing the firmware bugs, hyper-v just moves on and
> pretends that everything works fine, right?

What can I say. :-(

> 
> > Three thoughts:
> >
> > 1)  Are Hyper-V VMs the only place where the local APIC ID register might
> > have a bogus value?  Probably so, but you never know what might crawl
> > out.
> 
> Define bogus. MADT is the primary source of information because that's
> how we know how many CPUs (APICs) are there and what their APIC ID is
> which we can use to wake them up. So there is a reasonable expectation
> that this information is consistent with the rest of the system.

Commit d49597fd3bc7 "x86/cpu: Deal with broken firmware (VMWare/Xen)"
mentions VMware and XEN implementations that violate the spec.  The
commit is from late 2016.  Have these bad systems aged out and no longer
need accommodation?

> 
> The Intel SDM clearly says in Vol 3A section 9.4.5 Identifying Logical
> Processors in an MP System:
> 
>   "After the BIOS has completed the MP initialization protocol, each
>    logical processor can be uniquely identified by its local APIC
>    ID. Software can access these APIC IDs in either of the following
>    ways:"
> 
> These ways include read from APIC, read MADT, read CPUID and implies
> that this must be consistent. For X2APIC it's actually written out:
> 
>   "If the local APIC unit supports x2APIC and is operating in x2APIC
>    mode, 32-bit APIC ID can be read by executing a RDMSR instruction to
>    read the processor’s x2APIC ID register. This method is equivalent to
>    executing CPUID leaf 0BH described below."
> 
> AMD has not been following that in the early 64bit systems as they moved
> the APIC ID space to start at 32 for the first CPU in the first socket
> for whatever reasons. But since then the kernel reads back the APIC ID
> on AMD systems into topo.apicid. But that was long ago and can easily be
> dealt with because at least the real APIC ID and the MADT/MPTABLE
> entries are consistent.
> 
> Hypervisors have their own CPUID space to override functionality with
> their own magic stuff, but imposing their nutbolt ideas on the
> architectural part of the system is not only wrong, it's disrespectful
> against the OS developers who try to keep their system sane.
> 
> > 2) The natural response is "Well, fix Hyper-V!"  I first had this conversation
> > with the Hyper-V team about 5 years ago.  Some cases of the problem were
> > fixed, but some cases remain unfixed.  It's a long story.
> >
> > 3)  Since Hyper-V code in Linux already has an override for the apic->read()
> > function, it's possible to do a hack in that override so that apicid gets set to
> > the same value as initial_apicid, which matches the old code.  Here's the diff:
> 
> This collides massively with the other work I'm doing, which uses the
> MADT provided information to actually evaluate various topology related
> things upfront and later during bringup. Thats badly needed because lots
> of todays infrastructure is based on heuristics and guesswork.

Fair enough.  And I've re-raised the issue with the Hyper-V team.

> 
> But it seems I wasted a month on reworking all of this just to be
> stopped cold in the tracks by completely undocumented and unnecessary
> hyper-v abuse.
> 
> So if Hyper-V insists on abusing the initial APIC ID as read from CPUID
> for topology information related to L3, then hyper-v should override the
> cache topology mechanism and not impose this insanity on the basic
> topology evaluation infrastructure.
> 
> Yours seriously grumpy
> 
>       tglx
Thomas Gleixner July 31, 2023, 8:41 p.m. UTC | #5
On Mon, Jul 31 2023 at 16:25, Michael Kelley wrote:
> From: Thomas Gleixner <tglx@linutronix.de> Sent: Monday, July 31, 2023 5:35 AM
>> Define bogus. MADT is the primary source of information because that's
>> how we know how many CPUs (APICs) are there and what their APIC ID is
>> which we can use to wake them up. So there is a reasonable expectation
>> that this information is consistent with the rest of the system.
>
> Commit d49597fd3bc7 "x86/cpu: Deal with broken firmware (VMWare/Xen)"
> mentions VMware and XEN implementations that violate the spec.  The
> commit is from late 2016.  Have these bad systems aged out and no longer
> need accommodation?

They do, but this commit explicitely uses the MADT/real APIC ID value:

     c->initial_apicid = apicid;

So the new mechanics are accomodating for those, right?

Thanks,

        tglx
Gautham R. Shenoy Aug. 1, 2023, 7:05 a.m. UTC | #6
Hello Thomas,

On Fri, Jul 28, 2023 at 02:13:08PM +0200, Thomas Gleixner wrote:
> Topology evaluation is a complete disaster and impenetrable mess. It's
> scattered all over the place with some vendor implementatins doing early
> evaluation and some not. The most horrific part is the permanent
> overwriting of smt_max_siblings and __max_die_per_package, instead of
> establishing them once on the boot CPU and validating the result on the
> APs.
> 
> The goals are:
> 
>   - One topology evaluation entry point
> 
>   - Proper sharing of pointlessly duplicated code
> 
>   - Proper structuring of the evaluation logic and preferences.
> 
>   - Evaluating important system wide information only once on the boot CPU
> 
>   - Making the 0xb/0x1f leaf parsing less convoluted and actually fixing
>     the short comings of leaf 0x1f evaluation.
> 
> Start to consolidate the topology evaluation code by providing the entry
> points for the early boot CPU evaluation and for the final parsing on the
> boot CPU and the APs.
> 
> Move the trivial pieces into that new code:
> 
>    - The initialization of cpuinfo_x86::topo
> 
>    - The evaluation of CPUID leaf 1, which presets topo::initial_apicid
> 
>    - topo_apicid is set to topo::initial_apicid when invoked from early
>      boot. When invoked for the final evaluation on the boot CPU it reads
>      the actual APIC ID, which makes apic_get_initial_apicid() obsolete
>      once everything is converted over.
> 
> Provide a temporary helper function topo_converted() which shields off the
> not yet converted CPU vendors from invoking code which would break them.
> This shielding covers all vendor CPUs which support SMP, but not the
> historical pure UP ones as they only need the topology info init and
> eventually the initial APIC initialization.
> 
> Provide two new members in cpuinfo_x86::topo to store the maximum number of
> SMT siblings and the number of dies per package and add them to the debugfs
> readout. These two members will be used to populate this information on the
> boot CPU and to validate the APs against it.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/include/asm/topology.h       |   19 +++
>  arch/x86/kernel/cpu/Makefile          |    3 
>  arch/x86/kernel/cpu/common.c          |   23 +---
>  arch/x86/kernel/cpu/cpu.h             |    6 +
>  arch/x86/kernel/cpu/debugfs.c         |   37 ++++++
>  arch/x86/kernel/cpu/topology.h        |   32 +++++
>  arch/x86/kernel/cpu/topology_common.c |  187 ++++++++++++++++++++++++++++++++++
>  7 files changed, 290 insertions(+), 17 deletions(-)
> 
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -102,6 +102,25 @@ static inline void setup_node_to_cpumask
>  
>  #include <asm-generic/topology.h>
>  
> +/* Topology information */
> +enum x86_topology_domains {
> +	TOPO_SMT_DOMAIN,
> +	TOPO_CORE_DOMAIN,
> +	TOPO_MODULE_DOMAIN,
> +	TOPO_TILE_DOMAIN,
> +	TOPO_DIE_DOMAIN,
> +	TOPO_PKG_DOMAIN,
> +	TOPO_ROOT_DOMAIN,
> +	TOPO_MAX_DOMAIN,
> +};
> +

[..snip..]

> +static void topo_set_ids(struct topo_scan *tscan)
> +{
> +	struct cpuinfo_x86 *c = tscan->c;
> +	u32 apicid = c->topo.apicid;
> +
> +	c->topo.pkg_id = topo_shift_apicid(apicid, TOPO_ROOT_DOMAIN);

Shouldn't this use TOPO_PKG_DOMAIN instead of TOPO_ROOT_DOMAIN ?


> +	c->topo.die_id = topo_shift_apicid(apicid, TOPO_DIE_DOMAIN);
> +
> +	/* Relative core ID */
> +	c->topo.core_id = topo_relative_domain_id(apicid, TOPO_CORE_DOMAIN);
> +}
> +

--
Thanks and Regards
gautham.
Thomas Gleixner Aug. 1, 2023, 7:34 a.m. UTC | #7
On Tue, Aug 01 2023 at 12:35, Gautham R. Shenoy wrote:
> On Fri, Jul 28, 2023 at 02:13:08PM +0200, Thomas Gleixner wrote:
>> +static void topo_set_ids(struct topo_scan *tscan)
>> +{
>> +	struct cpuinfo_x86 *c = tscan->c;
>> +	u32 apicid = c->topo.apicid;
>> +
>> +	c->topo.pkg_id = topo_shift_apicid(apicid, TOPO_ROOT_DOMAIN);
>
> Shouldn't this use TOPO_PKG_DOMAIN instead of TOPO_ROOT_DOMAIN ?

Yup. It does not make a difference in that case. That's why I didn't
notice, but let me fix this for conistency sake.
diff mbox series

Patch

--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -102,6 +102,25 @@  static inline void setup_node_to_cpumask
 
 #include <asm-generic/topology.h>
 
+/* Topology information */
+enum x86_topology_domains {
+	TOPO_SMT_DOMAIN,
+	TOPO_CORE_DOMAIN,
+	TOPO_MODULE_DOMAIN,
+	TOPO_TILE_DOMAIN,
+	TOPO_DIE_DOMAIN,
+	TOPO_PKG_DOMAIN,
+	TOPO_ROOT_DOMAIN,
+	TOPO_MAX_DOMAIN,
+};
+
+struct x86_topology_system {
+	unsigned int	dom_shifts[TOPO_MAX_DOMAIN];
+	unsigned int	dom_size[TOPO_MAX_DOMAIN];
+};
+
+extern struct x86_topology_system x86_topo_system;
+
 extern const struct cpumask *cpu_coregroup_mask(int cpu);
 extern const struct cpumask *cpu_clustergroup_mask(int cpu);
 
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -17,7 +17,8 @@  KMSAN_SANITIZE_common.o := n
 # As above, instrumenting secondary CPU boot code causes boot hangs.
 KCSAN_SANITIZE_common.o := n
 
-obj-y			:= cacheinfo.o scattered.o topology.o
+obj-y			:= cacheinfo.o scattered.o
+obj-y			+= topology_common.o topology.o
 obj-y			+= common.o
 obj-y			+= rdrand.o
 obj-y			+= match.o
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1553,6 +1553,8 @@  static void __init early_identify_cpu(st
 		setup_force_cpu_cap(X86_FEATURE_CPUID);
 		cpu_parse_early_param();
 
+		cpu_init_topology(c);
+
 		if (this_cpu->c_early_init)
 			this_cpu->c_early_init(c);
 
@@ -1563,6 +1565,7 @@  static void __init early_identify_cpu(st
 			this_cpu->c_bsp_init(c);
 	} else {
 		setup_clear_cpu_cap(X86_FEATURE_CPUID);
+		cpu_init_topology(c);
 	}
 
 	setup_force_cpu_cap(X86_FEATURE_ALWAYS);
@@ -1708,18 +1711,6 @@  static void generic_identify(struct cpui
 
 	get_cpu_address_sizes(c);
 
-	if (c->cpuid_level >= 0x00000001) {
-		c->topo.initial_apicid = (cpuid_ebx(1) >> 24) & 0xFF;
-#ifdef CONFIG_X86_32
-# ifdef CONFIG_SMP
-		c->topo.apicid = apic->phys_pkg_id(c->topo.initial_apicid, 0);
-# else
-		c->topo.apicid = c->topo.initial_apicid;
-# endif
-#endif
-		c->topo.pkg_id = c->topo.initial_apicid;
-	}
-
 	get_model_name(c); /* Default name */
 
 	/*
@@ -1778,9 +1769,6 @@  static void identify_cpu(struct cpuinfo_
 	c->x86_model_id[0] = '\0';  /* Unset */
 	c->x86_max_cores = 1;
 	c->x86_coreid_bits = 0;
-	c->topo.cu_id = 0xff;
-	c->topo.llc_id = BAD_APICID;
-	c->topo.l2c_id = BAD_APICID;
 #ifdef CONFIG_X86_64
 	c->x86_clflush_size = 64;
 	c->x86_phys_bits = 36;
@@ -1799,6 +1787,8 @@  static void identify_cpu(struct cpuinfo_
 
 	generic_identify(c);
 
+	cpu_parse_topology(c);
+
 	if (this_cpu->c_identify)
 		this_cpu->c_identify(c);
 
@@ -1806,7 +1796,8 @@  static void identify_cpu(struct cpuinfo_
 	apply_forced_caps(c);
 
 #ifdef CONFIG_X86_64
-	c->topo.apicid = apic->phys_pkg_id(c->topo.initial_apicid, 0);
+	if (!topo_is_converted(c))
+		c->topo.apicid = apic->phys_pkg_id(c->topo.initial_apicid, 0);
 #endif
 
 	/*
--- a/arch/x86/kernel/cpu/cpu.h
+++ b/arch/x86/kernel/cpu/cpu.h
@@ -2,6 +2,11 @@ 
 #ifndef ARCH_X86_CPU_H
 #define ARCH_X86_CPU_H
 
+#include <asm/cpu.h>
+#include <asm/topology.h>
+
+#include "topology.h"
+
 /* attempt to consolidate cpu attributes */
 struct cpu_dev {
 	const char	*c_vendor;
@@ -95,4 +100,5 @@  static inline bool spectre_v2_in_eibrs_m
 	       mode == SPECTRE_V2_EIBRS_RETPOLINE ||
 	       mode == SPECTRE_V2_EIBRS_LFENCE;
 }
+
 #endif /* ARCH_X86_CPU_H */
--- a/arch/x86/kernel/cpu/debugfs.c
+++ b/arch/x86/kernel/cpu/debugfs.c
@@ -5,6 +5,8 @@ 
 #include <asm/apic.h>
 #include <asm/processor.h>
 
+#include "cpu.h"
+
 static int cpu_debug_show(struct seq_file *m, void *p)
 {
 	unsigned long cpu = (unsigned long)m->private;
@@ -43,12 +45,47 @@  static const struct file_operations dfs_
 	.release	= single_release,
 };
 
+static int dom_debug_show(struct seq_file *m, void *p)
+{
+	static const char *domain_names[TOPO_ROOT_DOMAIN] = {
+		[TOPO_SMT_DOMAIN]	= "Thread",
+		[TOPO_CORE_DOMAIN]	= "Core",
+		[TOPO_MODULE_DOMAIN]	= "Module",
+		[TOPO_TILE_DOMAIN]	= "Tile",
+		[TOPO_DIE_DOMAIN]	= "Die",
+		[TOPO_PKG_DOMAIN]	= "Package",
+	};
+	unsigned int dom, nthreads = 1;
+
+	for (dom = 0; dom < TOPO_ROOT_DOMAIN; dom++) {
+		nthreads *= x86_topo_system.dom_size[dom];
+		seq_printf(m, "domain: %-10s shift: %u dom_size: %5u max_threads: %5u\n",
+			   domain_names[dom], x86_topo_system.dom_shifts[dom],
+			   x86_topo_system.dom_size[dom], nthreads);
+	}
+	return 0;
+}
+
+static int dom_debug_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, dom_debug_show, inode->i_private);
+}
+
+static const struct file_operations dfs_dom_ops = {
+	.open		= dom_debug_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
 static __init int cpu_init_debugfs(void)
 {
 	struct dentry *dir, *base = debugfs_create_dir("topo", arch_debugfs_dir);
 	unsigned long id;
 	char name [10];
 
+	debugfs_create_file("domains", 0444, base, NULL, &dfs_dom_ops);
+
 	dir = debugfs_create_dir("cpus", base);
 	for_each_possible_cpu(id) {
 		sprintf(name, "%lu", id);
--- /dev/null
+++ b/arch/x86/kernel/cpu/topology.h
@@ -0,0 +1,32 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef ARCH_X86_TOPOLOGY_H
+#define ARCH_X86_TOPOLOGY_H
+
+struct topo_scan {
+	struct cpuinfo_x86	*c;
+	unsigned int		dom_shifts[TOPO_MAX_DOMAIN];
+	unsigned int		dom_ncpus[TOPO_MAX_DOMAIN];
+
+};
+
+bool topo_is_converted(struct cpuinfo_x86 *c);
+void cpu_init_topology(struct cpuinfo_x86 *c);
+void cpu_parse_topology(struct cpuinfo_x86 *c);
+void topology_set_dom(struct topo_scan *tscan, enum x86_topology_domains dom,
+		      unsigned int shift, unsigned int ncpus);
+
+static inline u32 topo_shift_apicid(u32 apicid, enum x86_topology_domains dom)
+{
+	if (dom == TOPO_SMT_DOMAIN)
+		return apicid;
+	return apicid >> x86_topo_system.dom_shifts[dom - 1];
+}
+
+static inline u32 topo_relative_domain_id(u32 apicid, enum x86_topology_domains dom)
+{
+	if (dom != TOPO_SMT_DOMAIN)
+		apicid >>= x86_topo_system.dom_shifts[dom - 1];
+	return apicid & (x86_topo_system.dom_size[dom] - 1);
+}
+
+#endif /* ARCH_X86_TOPOLOGY_H */
--- /dev/null
+++ b/arch/x86/kernel/cpu/topology_common.c
@@ -0,0 +1,187 @@ 
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/cpu.h>
+
+#include <xen/xen.h>
+
+#include <asm/apic.h>
+#include <asm/processor.h>
+#include <asm/smp.h>
+
+#include "cpu.h"
+
+struct x86_topology_system x86_topo_system __ro_after_init;
+
+void topology_set_dom(struct topo_scan *tscan, enum x86_topology_domains dom,
+		      unsigned int shift, unsigned int ncpus)
+{
+	tscan->dom_shifts[dom] = shift;
+	tscan->dom_ncpus[dom] = ncpus;
+
+	/* Propagate to the upper levels */
+	for (dom++; dom < TOPO_MAX_DOMAIN; dom++) {
+		tscan->dom_shifts[dom] = tscan->dom_shifts[dom - 1];
+		tscan->dom_ncpus[dom] = tscan->dom_ncpus[dom - 1];
+	}
+}
+
+bool topo_is_converted(struct cpuinfo_x86 *c)
+{
+	/* Temporary until everything is converted over. */
+	switch (boot_cpu_data.x86_vendor) {
+	case X86_VENDOR_AMD:
+	case X86_VENDOR_CENTAUR:
+	case X86_VENDOR_INTEL:
+	case X86_VENDOR_HYGON:
+	case X86_VENDOR_ZHAOXIN:
+		return false;
+	default:
+		/* Let all UP systems use the below */
+		return true;
+	}
+}
+
+static bool fake_topology(struct topo_scan *tscan)
+{
+	/*
+	 * Preset the CORE level shift for CPUID less systems and XEN_PV,
+	 * which has useless CPUID information.
+	 */
+	topology_set_dom(tscan, TOPO_SMT_DOMAIN, 0, 1);
+	topology_set_dom(tscan, TOPO_CORE_DOMAIN, 1, 1);
+
+	return tscan->c->cpuid_level < 1 || xen_pv_domain();
+}
+
+static void parse_topology(struct topo_scan *tscan, bool early)
+{
+	const struct cpuinfo_topology topo_defaults = {
+		.cu_id			= 0xff,
+		.llc_id			= BAD_APICID,
+		.l2c_id			= BAD_APICID,
+	};
+	struct cpuinfo_x86 *c = tscan->c;
+	struct {
+		u32	unused0		: 16,
+			nproc		:  8,
+			apicid		:  8;
+	} ebx;
+
+	c->topo = topo_defaults;
+
+	if (fake_topology(tscan))
+	    return;
+
+	/* Preset Initial APIC ID from CPUID leaf 1 */
+	cpuid_leaf_reg(1, CPUID_EBX, &ebx);
+	c->topo.initial_apicid = ebx.apicid;
+
+	/*
+	 * The initial invocation from early_identify_cpu() happens before
+	 * the APIC is mapped or X2APIC enabled. For establishing the
+	 * topology, that's not required. Use the initial APIC ID.
+	 */
+	if (early)
+		c->topo.apicid = c->topo.initial_apicid;
+	else
+		c->topo.apicid = read_apic_id();
+
+	/* The above is sufficient for UP */
+	if (!IS_ENABLED(CONFIG_SMP))
+		return;
+}
+
+static void topo_set_ids(struct topo_scan *tscan)
+{
+	struct cpuinfo_x86 *c = tscan->c;
+	u32 apicid = c->topo.apicid;
+
+	c->topo.pkg_id = topo_shift_apicid(apicid, TOPO_ROOT_DOMAIN);
+	c->topo.die_id = topo_shift_apicid(apicid, TOPO_DIE_DOMAIN);
+
+	/* Relative core ID */
+	c->topo.core_id = topo_relative_domain_id(apicid, TOPO_CORE_DOMAIN);
+}
+
+static void topo_set_max_cores(struct topo_scan *tscan)
+{
+	/*
+	 * Bug compatible for now. This is broken on hybrid systems:
+	 * 8 cores SMT + 8 cores w/o SMT
+	 * tscan.dom_ncpus[TOPO_CORE_DOMAIN] = 24; 24 / 2 = 12 !!
+	 *
+	 * Cannot be fixed without further topology enumeration changes.
+	 */
+	tscan->c->x86_max_cores = tscan->dom_ncpus[TOPO_CORE_DOMAIN] >>
+		x86_topo_system.dom_shifts[TOPO_SMT_DOMAIN];
+}
+
+void cpu_parse_topology(struct cpuinfo_x86 *c)
+{
+	unsigned int dom, cpu = smp_processor_id();
+	struct topo_scan tscan = { .c = c, };
+
+	parse_topology(&tscan, false);
+
+	if (!topo_is_converted(c))
+		return;
+
+	for (dom = TOPO_SMT_DOMAIN; dom < TOPO_MAX_DOMAIN; dom++) {
+		if (tscan.dom_shifts[dom] == x86_topo_system.dom_shifts[dom])
+			continue;
+		pr_err(FW_BUG "CPU%d: Topology domain %u shift %u != %u\n", cpu, dom,
+		       tscan.dom_shifts[dom], x86_topo_system.dom_shifts[dom]);
+	}
+
+	/* Bug compatible with the existing parsers */
+	if (tscan.dom_ncpus[TOPO_SMT_DOMAIN] > smp_num_siblings) {
+		if (system_state == SYSTEM_BOOTING) {
+			pr_warn_once("CPU%d: SMT detected and enabled late\n", cpu);
+			smp_num_siblings = tscan.dom_ncpus[TOPO_SMT_DOMAIN];
+		} else {
+			pr_warn_once("CPU%d: SMT detected after init. Too late!\n", cpu);
+		}
+	}
+
+	topo_set_ids(&tscan);
+	topo_set_max_cores(&tscan);
+}
+
+void __init cpu_init_topology(struct cpuinfo_x86 *c)
+{
+	struct topo_scan tscan = { .c = c, };
+	unsigned int dom, sft;
+
+	parse_topology(&tscan, true);
+
+	if (!topo_is_converted(c))
+		return;
+
+	/* Copy the shift values and calculate the unit sizes. */
+	memcpy(x86_topo_system.dom_shifts, tscan.dom_shifts, sizeof(x86_topo_system.dom_shifts));
+
+	dom = TOPO_SMT_DOMAIN;
+	x86_topo_system.dom_size[dom] = 1U << x86_topo_system.dom_shifts[dom];
+
+	for (dom++; dom < TOPO_MAX_DOMAIN; dom++) {
+		sft = x86_topo_system.dom_shifts[dom] - x86_topo_system.dom_shifts[dom - 1];
+		x86_topo_system.dom_size[dom] = 1U << sft;
+	}
+
+	topo_set_ids(&tscan);
+	topo_set_max_cores(&tscan);
+
+	/*
+	 * Bug compatible with the existing code. If the boot CPU does not
+	 * have SMT this ends up with one sibling. This needs way deeper
+	 * changes further down the road to get it right during early boot.
+	 */
+	smp_num_siblings = tscan.dom_ncpus[TOPO_SMT_DOMAIN];
+
+	/*
+	 * Neither it's clear whether there are as many dies as the APIC
+	 * space indicating die level is. But assume that the actual number
+	 * of CPUs gives a proper indication for now to stay bug compatible.
+	 */
+	__max_die_per_package = tscan.dom_ncpus[TOPO_DIE_DOMAIN] /
+		tscan.dom_ncpus[TOPO_DIE_DOMAIN - 1];
+}