mbox series

[0/3] memory,acpi: resize memory blocks based on CFMW alignment

Message ID 20241008044355.4325-1-gourry@gourry.net
Headers show
Series memory,acpi: resize memory blocks based on CFMW alignment | expand

Message

Gregory Price Oct. 8, 2024, 4:43 a.m. UTC
When physical address capacity is not aligned to the size of a memory
block managed size, the misaligned portion is not mapped - creating
an effective loss of capacity.

This appears to be a calculated decision based on the fact that most
regions would generally be aligned, and the loss of capacity would be
relatively limited. With CXL devices, this is no longer the case.

CXL exposes its memory for management through the ACPI CEDT (CXL Early
Detection Table) in a field called the CXL Fixed Memory Window.  Per
the CXL specification, this memory must be aligned to at least 256MB.

On X86, memory block capacity increases based on the overall capacity
of the machine - eventually reaching a maximum of 2GB per memory block.
When a CFMW aligns on 256MB, this causes a loss of at least 2GB of
capacity, and in some cases more.

It is also possible for multiple CFMW to be exposed for a single device.
This can happen if a reserved region intersects with the target memory
location of the memory device. This happens on AMD x86 platforms. 

This patch set detects the alignments of all CFMW in the ACPI CEDT,
and changes the memory block size downward to meet the largest common
denomenator of the supported memory regions.

To do this, we needed 3 changes:
    1) extern memory block management functions for the acpi driver
    2) modify x86 to update its cached block size value
    3) add code in acpi/numa/srat.c to do the alignment check

Presently this only affects x86, since this is the only architecture
that implements set_memory_block_size_order.

Presently this appears to only affect x86, and we only mitigated there
since it is the only arch to implement set_memory_block_size_order.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Gregory Price <gourry@gourry.net>

Gregory Price (3):
  memory: extern memory_block_size_bytes and set_memory_block_size_order
  x86/mm: if memblock size is adjusted, update the cached value
  acpi,srat: reduce memory block size if CFMWS has a smaller alignment

 arch/x86/mm/init_64.c    | 17 ++++++++++++--
 drivers/acpi/numa/srat.c | 48 ++++++++++++++++++++++++++++++++++++++++
 drivers/base/memory.c    |  6 +++++
 include/linux/memory.h   |  4 ++--
 4 files changed, 71 insertions(+), 4 deletions(-)

Comments

David Hildenbrand Oct. 14, 2024, 11:54 a.m. UTC | #1
On 08.10.24 17:21, Gregory Price wrote:
> On Tue, Oct 08, 2024 at 05:02:33PM +0200, David Hildenbrand wrote:
>> On 08.10.24 16:51, Gregory Price wrote:
>>>>> +int __weak set_memory_block_size_order(unsigned int order)
>>>>> +{
>>>>> +	return -ENODEV;
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(set_memory_block_size_order);
>>>>
>>>> I can understand what you are trying to achieve, but letting arbitrary
>>>> modules mess with this sounds like a bad idea.
>>>>
>>>
>>> I suppose the alternative is trying to scan the CEDT from inside each
>>> machine, rather than the ACPI driver?  Seems less maintainable.
>>>
>>> I don't entirely disagree with your comment.  I hummed and hawwed over
>>> externing this - hence the warning in the x86 machine.
>>>
>>> Open to better answers.
>>
>> Maybe an interface to add more restrictions on the maximum size might be
>> better (instead of setting the size/order, you would impose another upper
>> limit).
> 
> That is effectively what set_memory_block_size_order is, though.  Once
> blocks are exposed to the allocators, its no longer safe to change the
> size (in part because it was built assuming it wouldn't change, but I
> imagine there are other dragons waiting in the shadows to bite me).

Yes, we must run very early.

How is this supposed to interact with code like

set_block_size()

that also calls set_memory_block_size_order() on UV systems (assuming 
there will be CXL support sooner or later?)?


> 
> So this would basically amount to a lock-bit being set in the architecture,
> beyond which block size can no longer be changed and a big ol' splat
> can be generated that says "NO TOUCH".
> 
>> Just imagine having various users of such an interface ..
> 
> I don't wanna D:

Right, and it also doesn't make sense as explained in my other comment: 
this should never apply to loaded modules. :)
Gregory Price Oct. 14, 2024, 2:25 p.m. UTC | #2
On Mon, Oct 14, 2024 at 01:54:27PM +0200, David Hildenbrand wrote:
> On 08.10.24 17:21, Gregory Price wrote:
> > On Tue, Oct 08, 2024 at 05:02:33PM +0200, David Hildenbrand wrote:
> > > On 08.10.24 16:51, Gregory Price wrote:
> > > > > > +int __weak set_memory_block_size_order(unsigned int order)
> > > > > > +{
> > > > > > +	return -ENODEV;
> > > > > > +}
> > > > > > +EXPORT_SYMBOL_GPL(set_memory_block_size_order);
> > > > > 
> > > > > I can understand what you are trying to achieve, but letting arbitrary
> > > > > modules mess with this sounds like a bad idea.
> > > > > 
> > > > 
> > > > I suppose the alternative is trying to scan the CEDT from inside each
> > > > machine, rather than the ACPI driver?  Seems less maintainable.
> > > > 
> > > > I don't entirely disagree with your comment.  I hummed and hawwed over
> > > > externing this - hence the warning in the x86 machine.
> > > > 
> > > > Open to better answers.
> > > 
> > > Maybe an interface to add more restrictions on the maximum size might be
> > > better (instead of setting the size/order, you would impose another upper
> > > limit).
> > 
> > That is effectively what set_memory_block_size_order is, though.  Once
> > blocks are exposed to the allocators, its no longer safe to change the
> > size (in part because it was built assuming it wouldn't change, but I
> > imagine there are other dragons waiting in the shadows to bite me).
> 
> Yes, we must run very early.
> 
> How is this supposed to interact with code like
> 
> set_block_size()
> 
> that also calls set_memory_block_size_order() on UV systems (assuming there
> will be CXL support sooner or later?)?
> 
> 

Tying the other email to this one - just clarifying the way forward here.

It sounds like you're saying at a minimum drop EXPORT tags to prevent
modules from calling it - but it also sounds like built-ins need to be
prevented from touching it as well after a certain point in early boot.

Do you think I should go down the advise() path as suggested by Ira,
just adding a arch_lock_blocksize() bit and have set_..._order check it,
or should we just move towards each architecture having to go through
the ACPI:CEDT itself?

Doesn't sound like we've quite hit a consensus on where the actual
adjustment logic should land - just that this shouldn't be touched by modules.

~Gregory
David Hildenbrand Oct. 14, 2024, 8:32 p.m. UTC | #3
On 14.10.24 16:25, Gregory Price wrote:
> On Mon, Oct 14, 2024 at 01:54:27PM +0200, David Hildenbrand wrote:
>> On 08.10.24 17:21, Gregory Price wrote:
>>> On Tue, Oct 08, 2024 at 05:02:33PM +0200, David Hildenbrand wrote:
>>>> On 08.10.24 16:51, Gregory Price wrote:
>>>>>>> +int __weak set_memory_block_size_order(unsigned int order)
>>>>>>> +{
>>>>>>> +	return -ENODEV;
>>>>>>> +}
>>>>>>> +EXPORT_SYMBOL_GPL(set_memory_block_size_order);
>>>>>>
>>>>>> I can understand what you are trying to achieve, but letting arbitrary
>>>>>> modules mess with this sounds like a bad idea.
>>>>>>
>>>>>
>>>>> I suppose the alternative is trying to scan the CEDT from inside each
>>>>> machine, rather than the ACPI driver?  Seems less maintainable.
>>>>>
>>>>> I don't entirely disagree with your comment.  I hummed and hawwed over
>>>>> externing this - hence the warning in the x86 machine.
>>>>>
>>>>> Open to better answers.
>>>>
>>>> Maybe an interface to add more restrictions on the maximum size might be
>>>> better (instead of setting the size/order, you would impose another upper
>>>> limit).
>>>
>>> That is effectively what set_memory_block_size_order is, though.  Once
>>> blocks are exposed to the allocators, its no longer safe to change the
>>> size (in part because it was built assuming it wouldn't change, but I
>>> imagine there are other dragons waiting in the shadows to bite me).
>>
>> Yes, we must run very early.
>>
>> How is this supposed to interact with code like
>>
>> set_block_size()
>>
>> that also calls set_memory_block_size_order() on UV systems (assuming there
>> will be CXL support sooner or later?)?
>>
>>
> 
> Tying the other email to this one - just clarifying the way forward here.
> 
> It sounds like you're saying at a minimum drop EXPORT tags to prevent
> modules from calling it - but it also sounds like built-ins need to be
> prevented from touching it as well after a certain point in early boot.

Right, at least the EXPORT is not required.

> 
> Do you think I should go down the advise() path as suggested by Ira,
> just adding a arch_lock_blocksize() bit and have set_..._order check it,
> or should we just move towards each architecture having to go through
> the ACPI:CEDT itself?

Let's summarize what we currently have on x86 is:

1) probe_memory_block_size()

Triggered on first memory_block_size_bytes() invocation. Makes a 
decision based on:

a) Already set size using set_memory_block_size_order()
b) RAM size
c) Bare metal vs. virt (bare metal -> use max)
d) Virt: largest block size aligned to memory end


2) set_memory_block_size_order()

Triggered by set_block_size() on UV systems.


I don't think set_memory_block_size_order() is the right tool to use. We 
just want to leave that alone I think -- it's a direct translation of a 
kernel cmdline parameter that should win.

You essentially want to tweak the b)->d) logic to take other alignment 
into consideration.

Maybe have some simple callback mechanism probe_memory_block_size() that 
can consult other sources for alignment requirements?

If that's not an option, then another way to set further min-alignment 
requirements (whereby we take MIN(old_align, new_align))?
Gregory Price Oct. 14, 2024, 10:40 p.m. UTC | #4
On Mon, Oct 14, 2024 at 10:32:36PM +0200, David Hildenbrand wrote:
> On 14.10.24 16:25, Gregory Price wrote:
> > On Mon, Oct 14, 2024 at 01:54:27PM +0200, David Hildenbrand wrote:
> > > On 08.10.24 17:21, Gregory Price wrote:
> > > > On Tue, Oct 08, 2024 at 05:02:33PM +0200, David Hildenbrand wrote:
> > > > > On 08.10.24 16:51, Gregory Price wrote:
> > > > > > > > +int __weak set_memory_block_size_order(unsigned int order)
> > > > > > > > +{
> > > > > > > > +	return -ENODEV;
> > > > > > > > +}
> > > > > > > > +EXPORT_SYMBOL_GPL(set_memory_block_size_order);
> > > > > > > 
> > > > > > > I can understand what you are trying to achieve, but letting arbitrary
> > > > > > > modules mess with this sounds like a bad idea.
> > > > > > > 
> > > > > > 
> > > > > > I suppose the alternative is trying to scan the CEDT from inside each
> > > > > > machine, rather than the ACPI driver?  Seems less maintainable.
> > > > > > 
> > > > > > I don't entirely disagree with your comment.  I hummed and hawwed over
> > > > > > externing this - hence the warning in the x86 machine.
> > > > > > 
> > > > > > Open to better answers.
> > > > > 
> > > > > Maybe an interface to add more restrictions on the maximum size might be
> > > > > better (instead of setting the size/order, you would impose another upper
> > > > > limit).
> > > > 
> > > > That is effectively what set_memory_block_size_order is, though.  Once
> > > > blocks are exposed to the allocators, its no longer safe to change the
> > > > size (in part because it was built assuming it wouldn't change, but I
> > > > imagine there are other dragons waiting in the shadows to bite me).
> > > 
> > > Yes, we must run very early.
> > > 
> > > How is this supposed to interact with code like
> > > 
> > > set_block_size()
> > > 
> > > that also calls set_memory_block_size_order() on UV systems (assuming there
> > > will be CXL support sooner or later?)?
> > > 
> > > 
> > 
> > Tying the other email to this one - just clarifying the way forward here.
> > 
> > It sounds like you're saying at a minimum drop EXPORT tags to prevent
> > modules from calling it - but it also sounds like built-ins need to be
> > prevented from touching it as well after a certain point in early boot.
> 
> Right, at least the EXPORT is not required.
> 
> > 
> > Do you think I should go down the advise() path as suggested by Ira,
> > just adding a arch_lock_blocksize() bit and have set_..._order check it,
> > or should we just move towards each architecture having to go through
> > the ACPI:CEDT itself?
> 
> Let's summarize what we currently have on x86 is:
> 
> 1) probe_memory_block_size()
> 
> Triggered on first memory_block_size_bytes() invocation. Makes a decision
> based on:
> 
> a) Already set size using set_memory_block_size_order()
> b) RAM size
> c) Bare metal vs. virt (bare metal -> use max)
> d) Virt: largest block size aligned to memory end
> 
> 
> 2) set_memory_block_size_order()
> 
> Triggered by set_block_size() on UV systems.
> 
> 
> I don't think set_memory_block_size_order() is the right tool to use. We
> just want to leave that alone I think -- it's a direct translation of a
> kernel cmdline parameter that should win.
> 
> You essentially want to tweak the b)->d) logic to take other alignment into
> consideration.
> 
> Maybe have some simple callback mechanism probe_memory_block_size() that can
> consult other sources for alignment requirements?
>

Thanks for this - I'll cobble something together.

Probably this ends up falling out similar to what Ira suggested. 

drivers/acpi/numa/srat.c
    acpi_numa_init():
        order = parse_cfwm(...)
        memblock_advise_size(order);

drivers/base/memory.c
    static int memblock_size_order = 0; /* let arch choose */

    int memblock_advise_size(order)
        int old_order;
        int new_order;
        if (order <= 0)
            return -EINVAL;

        do {
            old_order = memblock_size_order;
            new_order = MIN(old_order, order);
        } while (!atomic_cmpxchg(&memblock_size_order, old_order, new_order));

        /* memblock_size_order is now <= order, if -1 then the probe won */
        return new_order;

    int memblock_probe_size()
        return atomic_xchg(&memblock_size_order, -1);

drivers/base/memblock.h
    #ifdef HOTPLUG
        export memblock_advise_size()
        export memblock_probe_size()
    #else
        static memblock_advice_size() { return -ENODEV; } /* always fail */
        static memblock_probe_size() { return 0; } /* arch chooses */
    #endif

arch/*/mm/...
    probe_block_size():
        memblock_probe_size();
        /* select minimum across above suggested values */

> If that's not an option, then another way to set further min-alignment
> requirements (whereby we take MIN(old_align, new_align))?
> 
> -- 
> Cheers,
> 
> David / dhildenb
>