diff mbox series

x86/i8259: Work around buggy legacy PIC

Message ID 20210512210459.1983026-1-luzmaximilian@gmail.com
State New
Headers show
Series x86/i8259: Work around buggy legacy PIC | expand

Commit Message

Maximilian Luz May 12, 2021, 9:04 p.m. UTC
The legacy PIC on the AMD variant of the Microsoft Surface Laptop 4 has
some problems on boot. For some reason it consistently does not respond
on the first try, requiring a couple more tries before it finally
responds.

This currently leads to the PIC not being properly recognized, which
prevents interrupt handling down the line. Ultimately, this also leads
to the pinctrl-amd driver failing to probe due to platform_get_irq()
returning -EINVAL for its base IRQ. That, in turn, means that several
interrupts are not available and device drivers relying on those will
defer probing indefinitely, as querying those interrupts returns
-EPROBE_DEFER.

Add a quirk table and a retry-loop to work around that.

Also switch to pr_info() due to complaints by checkpatch and add a
pr_fmt() definition for completeness.

Cc: <stable@vger.kernel.org> # 5.10+
Co-developed-by: Sachi King <nakato@nakato.io>
Signed-off-by: Sachi King <nakato@nakato.io>
Signed-off-by: Maximilian Luz <luzmaximilian@gmail.com>
---
 arch/x86/kernel/i8259.c | 51 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 46 insertions(+), 5 deletions(-)

Comments

David Laight May 13, 2021, 8:10 a.m. UTC | #1
From: Maximilian Luz

> Sent: 12 May 2021 22:05

> 

> The legacy PIC on the AMD variant of the Microsoft Surface Laptop 4 has

> some problems on boot. For some reason it consistently does not respond

> on the first try, requiring a couple more tries before it finally

> responds.


That seems very strange, something else must be going on that causes the grief.
The 8259 will be built into to the one of the cpu support chips.
I can't imagine that requires anything special.

It's not as though you have a real 8259 - which even a 286 can
break the inter-cycle recovery on (with the circuit from the
application note).

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Maximilian Luz May 13, 2021, 10:11 a.m. UTC | #2
On 5/13/21 10:10 AM, David Laight wrote:
> From: Maximilian Luz

>> Sent: 12 May 2021 22:05

>>

>> The legacy PIC on the AMD variant of the Microsoft Surface Laptop 4 has

>> some problems on boot. For some reason it consistently does not respond

>> on the first try, requiring a couple more tries before it finally

>> responds.

> 

> That seems very strange, something else must be going on that causes the grief.

> The 8259 will be built into to the one of the cpu support chips.

> I can't imagine that requires anything special.


Right, it's definitely strange. Both Sachi (I imagine) and I don't know
much about these devices, so we're open for suggestions.

For reference: [1] and following comments are the discussion where we
(mostly Sachi) discovered the issue, tracking this from
platform_get_irq() returning -EINVAL to the PIC not probing. It also has
a dmesg log [2] attached, but as far as we can tell

     [    0.105820] Using NULL legacy PIC

is the only relevant line to that in there.

And lastly, if that's any help at all: The PIC device is described in
ACPI in [3]. The Surface Laptop 3 also has an AMD CPU (although a prior
generation) and has the PIC described in the exact same way (can also be
found in that repository), but doesn't exhibit that behavior (and
doesn't show the "Using NULL legacy PIC" line). I expect there's not
much you can change to that definition so that's probably irrelevant
here.

Again, I don't really know anything about these devices, so my guess
would be bad hardware revision or bad firmware revision. All I know is
that retrying seems to "fix" it.

> It's not as though you have a real 8259 - which even a 286 can

> break the inter-cycle recovery on (with the circuit from the

> application note).


Right, I imagine that's some emulation for legacy reasons?

Regards,
Max

[1]: https://github.com/linux-surface/linux-surface/issues/425#issuecomment-831921098
[2]: https://github.com/linux-surface/linux-surface/files/6421166/dmesg-2021-05-04_22-11.txt
[3]: https://github.com/linux-surface/acpidumps/blob/69d5ecc1954ea5e90829b8e33541308e7451e951/surface_laptop_4_amd/dsdt.dsl#L1201-L1221
David Laight May 13, 2021, 10:36 a.m. UTC | #3
> -----Original Message-----

> From: Maximilian Luz <luzmaximilian@gmail.com>

> Sent: 13 May 2021 11:12

> To: David Laight <David.Laight@ACULAB.COM>; Thomas Gleixner <tglx@linutronix.de>; Ingo Molnar

> <mingo@redhat.com>; Borislav Petkov <bp@alien8.de>

> Cc: H. Peter Anvin <hpa@zytor.com>; Sachi King <nakato@nakato.io>; x86@kernel.org; linux-

> kernel@vger.kernel.org; stable@vger.kernel.org

> Subject: Re: [PATCH] x86/i8259: Work around buggy legacy PIC

> 

> On 5/13/21 10:10 AM, David Laight wrote:

> > From: Maximilian Luz

> >> Sent: 12 May 2021 22:05

> >>

> >> The legacy PIC on the AMD variant of the Microsoft Surface Laptop 4 has

> >> some problems on boot. For some reason it consistently does not respond

> >> on the first try, requiring a couple more tries before it finally

> >> responds.

> >

> > That seems very strange, something else must be going on that causes the grief.

> > The 8259 will be built into to the one of the cpu support chips.

> > I can't imagine that requires anything special.

> 

> Right, it's definitely strange. Both Sachi (I imagine) and I don't know

> much about these devices, so we're open for suggestions.


I found a copy of the datasheet (I don't seem to have the black book):

https://pdos.csail.mit.edu/6.828/2010/readings/hardware/8259A.pdf

The PC hardware has two 8259 in cascade mode.
(Cascaded using an interrupt that wasn't really using in the original
8088 PC which only had one 8259.)

I wonder if the bios has actually initialised is properly.
Some initialisation writes have to be done to set everything up.

It is also worth noting that the probe code is spectacularly crap.
It writes 0xff and then checks that 0xff is read back.
Almost anything (including a failed PCIe read to the ISA bridge)
will return 0xff and make the test pass.

It's about 35 years since I last wrote the code to initialise an 8259.
The memory cells are foggy.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
David Laight May 14, 2021, 10:51 a.m. UTC | #4
From: Sachi King

> Sent: 14 May 2021 20:41

> 

> On Thursday, May 13, 2021 8:36:27 PM AEST David Laight wrote:

> > > -----Original Message-----

> > > From: Maximilian Luz <luzmaximilian@gmail.com>

> > > Sent: 13 May 2021 11:12

> > > To: David Laight <David.Laight@ACULAB.COM>; Thomas Gleixner

> > > <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>; Borislav Petkov

> > > <bp@alien8.de>

> > > Cc: H. Peter Anvin <hpa@zytor.com>; Sachi King <nakato@nakato.io>;

> > > x86@kernel.org; linux-kernel@vger.kernel.org; stable@vger.kernel.org

> > > Subject: Re: [PATCH] x86/i8259: Work around buggy legacy PIC

> > >

> > > On 5/13/21 10:10 AM, David Laight wrote:

> > >

> > > > From: Maximilian Luz

> > > >

> > > >> Sent: 12 May 2021 22:05

> > > >>

> > > >>

> > > >>

> > > >> The legacy PIC on the AMD variant of the Microsoft Surface Laptop 4

> > > >> has

> > > >> some problems on boot. For some reason it consistently does not

> > > >> respond

> > > >> on the first try, requiring a couple more tries before it finally

> > > >> responds.

> > > >

> > > >

> > > >

> > > > That seems very strange, something else must be going on that causes the

> > > > grief.

> > > > The 8259 will be built into to the one of the cpu support

> > > > chips.

> > > > I can't imagine that requires anything special.

> > >

> > >

> > > Right, it's definitely strange. Both Sachi (I imagine) and I don't know

> > > much about these devices, so we're open for suggestions.

> >

> >

> > I found a copy of the datasheet (I don't seem to have the black book):

> >

> > https://pdos.csail.mit.edu/6.828/2010/readings/hardware/8259A.pdf

> >

> > The PC hardware has two 8259 in cascade mode.

> > (Cascaded using an interrupt that wasn't really using in the original

> > 8088 PC which only had one 8259.)

> >

> > I wonder if the bios has actually initialised is properly.

> > Some initialisation writes have to be done to set everything up.

> 

> I suspect by the displayed behaviour you are correct and that it has

> not.  I'm struggling to figure out who to talk to to see that is

> something that can be fixed in the firmware.

> 

> > It is also worth noting that the probe code is spectacularly crap.

> > It writes 0xff and then checks that 0xff is read back.

> > Almost anything (including a failed PCIe read to the ISA bridge)

> > will return 0xff and make the test pass.

> 

> I was under the impression that it wrote 0xfb, and 0xff would be

> considered a failure.


I probably misread it.
For anything like a probe you really need to write one value
and read back something different - that is hopefully reasonably unique.
OTOH just the write can trash the wrong hardware - many old PC
hardware probes were very dubious.

Although unlikely here (since the logic is all inside chip)
on a real IO bus with tristate drivers a write followed by
a read to an unknown address could easily return the written
value due to capacitance of the data bus.

> > It's about 35 years since I last wrote the code to initialise an 8259.

> > The memory cells are foggy.

> 

> I'm not sure the i8259 is needed on the device, as the interrupts

> appear to function on the device if I bypass the nr_legacy_irqs() check

> while the legacy_pic is set to the null_legacy_pic.

> 

> The null_legacy_pic however specifies having 0 irqs, and the io_apic

> does not allow us to set the pin attributes unless the pin we are

> attempting to set is less than nr_legacy_irqs.

> 

> The IOAPIC seems to take responsibility for the 0-15 interrupts on this

> specific hardware, should we maybe be ignoring the i8259 and looking

> into allowing interrupts 0-15 to be setup even when the legacy_pic is

> not available?


That woke up some memory cells.
IIRC on an SMP kernel the IOAPIC is always used in preference to the 8259.
So maybe the initialisation code is just plain wrong!

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Maximilian Luz May 14, 2021, 11:58 a.m. UTC | #5
On 5/14/21 9:41 PM, Sachi King wrote:
> On Thursday, May 13, 2021 8:36:27 PM AEST David Laight wrote:

>>> -----Original Message-----

>>> From: Maximilian Luz <luzmaximilian@gmail.com>

>>> Sent: 13 May 2021 11:12

>>> To: David Laight <David.Laight@ACULAB.COM>; Thomas Gleixner

>>> <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>; Borislav Petkov

>>> <bp@alien8.de>

>>> Cc: H. Peter Anvin <hpa@zytor.com>; Sachi King <nakato@nakato.io>;

>>> x86@kernel.org; linux-kernel@vger.kernel.org; stable@vger.kernel.org

>>> Subject: Re: [PATCH] x86/i8259: Work around buggy legacy PIC

>>>

>>> On 5/13/21 10:10 AM, David Laight wrote:

>>>

>>>> From: Maximilian Luz

>>>>

>>>>> Sent: 12 May 2021 22:05

>>>>>

>>>>>

>>>>>

>>>>> The legacy PIC on the AMD variant of the Microsoft Surface Laptop 4

>>>>> has

>>>>> some problems on boot. For some reason it consistently does not

>>>>> respond

>>>>> on the first try, requiring a couple more tries before it finally

>>>>> responds.

>>>>

>>>>

>>>>

>>>> That seems very strange, something else must be going on that causes the

>>>> grief.

>>>> The 8259 will be built into to the one of the cpu support

>>>> chips.

>>>> I can't imagine that requires anything special.

>>>

>>>

>>> Right, it's definitely strange. Both Sachi (I imagine) and I don't know

>>> much about these devices, so we're open for suggestions.

>>

>>

>> I found a copy of the datasheet (I don't seem to have the black book):

>>

>> https://pdos.csail.mit.edu/6.828/2010/readings/hardware/8259A.pdf

>>

>> The PC hardware has two 8259 in cascade mode.

>> (Cascaded using an interrupt that wasn't really using in the original

>> 8088 PC which only had one 8259.)

>>

>> I wonder if the bios has actually initialised is properly.

>> Some initialisation writes have to be done to set everything up.

> 

> I suspect by the displayed behaviour you are correct and that it has

> not.  I'm struggling to figure out who to talk to to see that is

> something that can be fixed in the firmware.


I'd assume that _some_ sort of interrupt setup is done by the BIOS/UEFI.
The UEFI on those devices is fairly well-featured, with touch support
via SPI and all. Furthermore, keyboard (also supported in the device's
UEFI) is handled via a custom UART protocol. Unless they rely on polling
for all of that, I believe they'd have to set up some interrupts.

Although, as you mention later on, that could also be handled via the
IOAPIC and the PIC is actually not supposed to be used. Maybe some
legacy component that never got tested and just broke with some new
hardware/firmware revision without anyone noticing? And since Linux
still seems to rely on that, we might be the first to notice.

Regards,
Max
Thomas Gleixner May 14, 2021, 1:01 p.m. UTC | #6
David,

On Thu, May 13 2021 at 10:36, David Laight wrote:

>> -----Original Message-----

>> From: Maximilian Luz <luzmaximilian@gmail.com>

>> Sent: 13 May 2021 11:12

>> To: David Laight <David.Laight@ACULAB.COM>; Thomas Gleixner <tglx@linutronix.de>; Ingo Molnar

>> <mingo@redhat.com>; Borislav Petkov <bp@alien8.de>

>> Cc: H. Peter Anvin <hpa@zytor.com>; Sachi King <nakato@nakato.io>; x86@kernel.org; linux-

>> kernel@vger.kernel.org; stable@vger.kernel.org

>> Subject: Re: [PATCH] x86/i8259: Work around buggy legacy PIC


can you please fix your mail client and spare us the useless header
duplication in the reply?

> It is also worth noting that the probe code is spectacularly crap.

> It writes 0xff and then checks that 0xff is read back.

> Almost anything (including a failed PCIe read to the ISA bridge)

> will return 0xff and make the test pass.


        unsigned char probe_val = ~(1 << PIC_CASCADE_IR);

	outb(probe_val, PIC_MASTER_IMR);
	new_val = inb(PIC_MASTER_IMR);

How is that writing 0xFF?

Thanks,

        tglx
David Laight May 14, 2021, 1:13 p.m. UTC | #7
From: Thomas Gleixner

> Sent: 14 May 2021 14:02

> 

> David,

> 

> On Thu, May 13 2021 at 10:36, David Laight wrote:

> 

> >> -----Original Message-----

> >> From: Maximilian Luz <luzmaximilian@gmail.com>

> 

> can you please fix your mail client and spare us the useless header

> duplication in the reply?


I have to delete them by hand - must have forgotten, I can't fix outlook :-)

> > It is also worth noting that the probe code is spectacularly crap.

> > It writes 0xff and then checks that 0xff is read back.

> > Almost anything (including a failed PCIe read to the ISA bridge)

> > will return 0xff and make the test pass.

> 

>         unsigned char probe_val = ~(1 << PIC_CASCADE_IR);

> 

> 	outb(probe_val, PIC_MASTER_IMR);

> 	new_val = inb(PIC_MASTER_IMR);

> 

> How is that writing 0xFF?


Sorry I misread the code and diagnostic output.

In any case writing a value and expecting the same value back
isn't exactly a high-quality probe.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Thomas Gleixner May 14, 2021, 1:44 p.m. UTC | #8
Max,

On Thu, May 13 2021 at 12:11, Maximilian Luz wrote:
> And lastly, if that's any help at all: The PIC device is described in

> ACPI in [3]. The Surface Laptop 3 also has an AMD CPU (although a prior

> generation) and has the PIC described in the exact same way (can also be

> found in that repository), but doesn't exhibit that behavior (and

> doesn't show the "Using NULL legacy PIC" line). I expect there's not

> much you can change to that definition so that's probably irrelevant

> here.

>

> Again, I don't really know anything about these devices, so my guess

> would be bad hardware revision or bad firmware revision. All I know is

> that retrying seems to "fix" it.


That might be a a power optimization thing.

Except for older systems the PIC is not really required for IOAPiC to
work. But there is some historical code which makes assumptions. We can
change that, but that needs some careful thoughts.

Thanks,

        tglx
David Laight May 14, 2021, 4:12 p.m. UTC | #9
From: Thomas Gleixner

> Sent: 14 May 2021 14:45

> 

> Max,

> 

> On Thu, May 13 2021 at 12:11, Maximilian Luz wrote:

> > And lastly, if that's any help at all: The PIC device is described in

> > ACPI in [3]. The Surface Laptop 3 also has an AMD CPU (although a prior

> > generation) and has the PIC described in the exact same way (can also be

> > found in that repository), but doesn't exhibit that behavior (and

> > doesn't show the "Using NULL legacy PIC" line). I expect there's not

> > much you can change to that definition so that's probably irrelevant

> > here.

> >

> > Again, I don't really know anything about these devices, so my guess

> > would be bad hardware revision or bad firmware revision. All I know is

> > that retrying seems to "fix" it.

> 

> That might be a a power optimization thing.

> 

> Except for older systems the PIC is not really required for IOAPiC to

> work. But there is some historical code which makes assumptions. We can

> change that, but that needs some careful thoughts.


A more interesting probe would be:
- Write some value to register 1 - the mask.
- Write 9 to register zero (selects interrupt in service register).
- Read register 0 - should be zero since we aren't in as ISR.
- Read register 1 - should get the mask back.
You can also write 8 to register 0, reads then return the pending interrupts.
Their might be pending interrupts - so that value can't be checked.

But if reads start returning the last written value you might only
have capacitors on the data bus.

The required initialisation registers are pretty fixed for the PC hardware.
But finding the values requires a bit of work.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Ingo Molnar May 14, 2021, 4:19 p.m. UTC | #10
* David Laight <David.Laight@ACULAB.COM> wrote:

> > > It is also worth noting that the probe code is spectacularly crap.

> > > It writes 0xff and then checks that 0xff is read back.

> > > Almost anything (including a failed PCIe read to the ISA bridge)

> > > will return 0xff and make the test pass.

> > 

> >         unsigned char probe_val = ~(1 << PIC_CASCADE_IR);

> > 

> > 	outb(probe_val, PIC_MASTER_IMR);

> > 	new_val = inb(PIC_MASTER_IMR);

> > 

> > How is that writing 0xFF?

> 

> Sorry I misread the code and diagnostic output.

> 

> In any case writing a value and expecting the same value back

> isn't exactly a high-quality probe.


It's not, and it's not intended to be: 0x21 is a well-known port nobody was 
crazy enough to override yet, so that probe basically filters out the 
"there is nothing at that port, at all" case, which would normally return 
0xff, or in a few weird cases 0x00 perhaps.

Writing something inbetween those values and getting the same value back 
tells us that something functional occupies that well-known IO-port, 
pretending to be a i8259 PIC.

Which is what we wanted to know, given the context.

Thanks,

	Ingo
H. Peter Anvin May 14, 2021, 5:28 p.m. UTC | #11
On 5/14/21 9:12 AM, David Laight wrote:
> 

> A more interesting probe would be:

> - Write some value to register 1 - the mask.

> - Write 9 to register zero (selects interrupt in service register).

> - Read register 0 - should be zero since we aren't in as ISR.

> - Read register 1 - should get the mask back.

> You can also write 8 to register 0, reads then return the pending interrupts.

> Their might be pending interrupts - so that value can't be checked.

> 

> But if reads start returning the last written value you might only

> have capacitors on the data bus.


What data bus? These things haven't been on a physical parallel bus for 
ages.

> The required initialisation registers are pretty fixed for the PC hardware.

> But finding the values requires a bit of work.

> 

> 	David


And you always risk activating new bugs.

Since this appears to be a specific platform advertising the wrong 
answer in firmware, this is better handled as a quirk.

	-hpa
Thomas Gleixner May 14, 2021, 5:32 p.m. UTC | #12
On Fri, May 14 2021 at 13:58, Maximilian Luz wrote:
> On 5/14/21 9:41 PM, Sachi King wrote:

> I'd assume that _some_ sort of interrupt setup is done by the BIOS/UEFI.

> The UEFI on those devices is fairly well-featured, with touch support

> via SPI and all. Furthermore, keyboard (also supported in the device's

> UEFI) is handled via a custom UART protocol. Unless they rely on polling

> for all of that, I believe they'd have to set up some interrupts.


Polling would be truly surprising.

> Although, as you mention later on, that could also be handled via the

> IOAPIC and the PIC is actually not supposed to be used. Maybe some

> legacy component that never got tested and just broke with some new

> hardware/firmware revision without anyone noticing? And since Linux

> still seems to rely on that, we might be the first to notice.


That's a valid assumption. As I said, we can make IOAPIC work even w/o
PIC. I'll have a look how much PIC assumptions are still around.

Thanks,

        tglx
H. Peter Anvin May 14, 2021, 5:35 p.m. UTC | #13
On 5/14/21 10:32 AM, Thomas Gleixner wrote:
> 

> That's a valid assumption. As I said, we can make IOAPIC work even w/o

> PIC. I'll have a look how much PIC assumptions are still around.

> 


As far as I read, the problem isn't actually the absence of a PIC (we 
definitely boot systems without PICs all the time now), but rather that 
the PIC is advertised in ACPI but is buggy or absent; a similar platform 
with different firmware doesn't have problem.

If my understanding of the thread is correct, it's quirk fodder.

	-hpa
Sachi King May 14, 2021, 7:41 p.m. UTC | #14
On Thursday, May 13, 2021 8:36:27 PM AEST David Laight wrote:
> > -----Original Message-----

> > From: Maximilian Luz <luzmaximilian@gmail.com>

> > Sent: 13 May 2021 11:12

> > To: David Laight <David.Laight@ACULAB.COM>; Thomas Gleixner

> > <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>; Borislav Petkov

> > <bp@alien8.de>

> > Cc: H. Peter Anvin <hpa@zytor.com>; Sachi King <nakato@nakato.io>;

> > x86@kernel.org; linux-kernel@vger.kernel.org; stable@vger.kernel.org

> > Subject: Re: [PATCH] x86/i8259: Work around buggy legacy PIC

> > 

> > On 5/13/21 10:10 AM, David Laight wrote:

> > 

> > > From: Maximilian Luz

> > > 

> > >> Sent: 12 May 2021 22:05

> > >>

> > >>

> > >>

> > >> The legacy PIC on the AMD variant of the Microsoft Surface Laptop 4

> > >> has

> > >> some problems on boot. For some reason it consistently does not

> > >> respond

> > >> on the first try, requiring a couple more tries before it finally

> > >> responds.

> > >

> > >

> > >

> > > That seems very strange, something else must be going on that causes the

> > > grief.

> > > The 8259 will be built into to the one of the cpu support

> > > chips.

> > > I can't imagine that requires anything special.

> > 

> > 

> > Right, it's definitely strange. Both Sachi (I imagine) and I don't know

> > much about these devices, so we're open for suggestions.

> 

> 

> I found a copy of the datasheet (I don't seem to have the black book):

> 

> https://pdos.csail.mit.edu/6.828/2010/readings/hardware/8259A.pdf

> 

> The PC hardware has two 8259 in cascade mode.

> (Cascaded using an interrupt that wasn't really using in the original

> 8088 PC which only had one 8259.)

> 

> I wonder if the bios has actually initialised is properly.

> Some initialisation writes have to be done to set everything up.


I suspect by the displayed behaviour you are correct and that it has
not.  I'm struggling to figure out who to talk to to see that is
something that can be fixed in the firmware.

> It is also worth noting that the probe code is spectacularly crap.

> It writes 0xff and then checks that 0xff is read back.

> Almost anything (including a failed PCIe read to the ISA bridge)

> will return 0xff and make the test pass.


I was under the impression that it wrote 0xfb, and 0xff would be
considered a failure.

> It's about 35 years since I last wrote the code to initialise an 8259.

> The memory cells are foggy.


I'm not sure the i8259 is needed on the device, as the interrupts
appear to function on the device if I bypass the nr_legacy_irqs() check
while the legacy_pic is set to the null_legacy_pic.

The null_legacy_pic however specifies having 0 irqs, and the io_apic
does not allow us to set the pin attributes unless the pin we are
attempting to set is less than nr_legacy_irqs.

The IOAPIC seems to take responsibility for the 0-15 interrupts on this
specific hardware, should we maybe be ignoring the i8259 and looking
into allowing interrupts 0-15 to be setup even when the legacy_pic is
not available?

Cheers,
Sachi
Maximilian Luz May 14, 2021, 10:47 p.m. UTC | #15
On 5/14/21 7:35 PM, H. Peter Anvin wrote:
> On 5/14/21 10:32 AM, Thomas Gleixner wrote:

>>

>> That's a valid assumption. As I said, we can make IOAPIC work even w/o

>> PIC. I'll have a look how much PIC assumptions are still around.

>>

> 

> As far as I read, the problem isn't actually the absence of a PIC (we definitely boot systems without PICs all the time now), but rather that the PIC is advertised in ACPI but is buggy or absent; a similar platform with different firmware doesn't have problem.

> 

> If my understanding of the thread is correct, it's quirk fodder.


I believe the theory was that, while the PIC is advertised in ACPI, it
might be expected to not be used and only present for some legacy reason
(therefore untested and buggy). Which I believe led to the question
whether we shouldn't prefer IOAPIC on systems like that in general. So I
guess it comes down to how you define "systems like that". By Tomas'
comment, I guess it should be possible to implement this as "systems
that should prefer IOAPIC over legacy PIC" quirk.

I guess all modern machines should have an IOAPIC, so it might also be
preferable to expand that definition, maybe over time and with enough
testing.

If possible, I think that would be preferable to the "just retry until
it works" kind of workaround.

As Sachi mentioned in her reply:

> I'm not sure the i8259 is needed on the device, as the interrupts

> appear to function on the device if I bypass the nr_legacy_irqs() check

> while the legacy_pic is set to the null_legacy_pic.

> 

> The null_legacy_pic however specifies having 0 irqs, and the io_apic

> does not allow us to set the pin attributes unless the pin we are

> attempting to set is less than nr_legacy_irqs.

> 

> The IOAPIC seems to take responsibility for the 0-15 interrupts on this

> specific hardware, should we maybe be ignoring the i8259 and looking

> into allowing interrupts 0-15 to be setup even when the legacy_pic is

> not available?


Just my two cents, please keep in mind that I'm out of my depth here.

Regards,
Max
Thomas Gleixner May 17, 2021, 6:40 p.m. UTC | #16
Max,

On Sat, May 15 2021 at 00:47, Maximilian Luz wrote:
> I believe the theory was that, while the PIC is advertised in ACPI, it

> might be expected to not be used and only present for some legacy reason

> (therefore untested and buggy). Which I believe led to the question

> whether we shouldn't prefer IOAPIC on systems like that in general. So I

> guess it comes down to how you define "systems like that". By Tomas'

> comment, I guess it should be possible to implement this as "systems

> that should prefer IOAPIC over legacy PIC" quirk.

>

> I guess all modern machines should have an IOAPIC, so it might also be

> preferable to expand that definition, maybe over time and with enough

> testing.


I just double checked and we actually can boot just fine without the
PIC even when it is advertised, but disfunctional.

Can you please add "apic=verbose" to the kernel command line and provide
full dmesg output for a kernel w/o your patch and one with your patch
applied?

Thanks,

        tglx
Maximilian Luz May 17, 2021, 7:25 p.m. UTC | #17
On 5/17/21 8:40 PM, Thomas Gleixner wrote:
> Max,

> 

> On Sat, May 15 2021 at 00:47, Maximilian Luz wrote:

>> I believe the theory was that, while the PIC is advertised in ACPI, it

>> might be expected to not be used and only present for some legacy reason

>> (therefore untested and buggy). Which I believe led to the question

>> whether we shouldn't prefer IOAPIC on systems like that in general. So I

>> guess it comes down to how you define "systems like that". By Tomas'

>> comment, I guess it should be possible to implement this as "systems

>> that should prefer IOAPIC over legacy PIC" quirk.

>>

>> I guess all modern machines should have an IOAPIC, so it might also be

>> preferable to expand that definition, maybe over time and with enough

>> testing.

> 

> I just double checked and we actually can boot just fine without the

> PIC even when it is advertised, but disfunctional.

> 

> Can you please add "apic=verbose" to the kernel command line and provide

> full dmesg output for a kernel w/o your patch and one with your patch

> applied?


I don't actually own an affected device, but I'm sure Sachi can provide
you with that.

As far as we can tell, due to the NULL PIC being chosen nr_legacy_irqs()
returns 0. That in turn causes mp_check_pin_attr() to return false
because is_level and active_low don't seem to match the expected values.
That check is essentially ignored if nr_legacy_irqs() returns a high
enough value. I guess that might also be a firmware bug here? Not sure
where the expected values come from.

Due to this, mp_map_pin_to_irq() fails with -EBUSY which causes
acpi_register_gsi() to fail. That fails in acpi_dev_get_irqresource(),
which causes the IRQ resource to be marked as disabled.

Down the line, this then causes platform_get_irq() to return -EINVAL,
because the IRQ we're trying to get has the IORESOURCE_DISABLED bit set.

Sachi can probably walk you through this a bit better as she's the one
who tracked this down. See also [1, 2] and following comments.

Regards,
Max

[1]: https://github.com/linux-surface/linux-surface/issues/425#issuecomment-835309201
[2]: https://github.com/linux-surface/linux-surface/issues/425#issuecomment-835261784
Thomas Gleixner May 17, 2021, 11:27 p.m. UTC | #18
On Mon, May 17 2021 at 21:25, Maximilian Luz wrote:
> On 5/17/21 8:40 PM, Thomas Gleixner wrote:

>> Can you please add "apic=verbose" to the kernel command line and provide

>> full dmesg output for a kernel w/o your patch and one with your patch

>> applied?

>

> I don't actually own an affected device, but I'm sure Sachi can provide

> you with that.


Ok.

> As far as we can tell, due to the NULL PIC being chosen nr_legacy_irqs()

> returns 0. That in turn causes mp_check_pin_attr() to return false

> because is_level and active_low don't seem to match the expected

> values.


Ok.

> That check is essentially ignored if nr_legacy_irqs() returns a high

> enough value.


Close enough.

> I guess that might also be a firmware bug here? Not sure where the

> expected values come from.


They come from the interrupt override ACPI table and if not supplied
then irq 0-15 is preset with default values, which are type=edge and
polarity=high, i.e.  the opposite of what the failing driver wants.

The ACPI table lacks an override entry for IRQ7. I looked at one of the
dmesg files in that github thread and that has overrides:

[    0.111674] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.111681] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level)
[    0.111688] ACPI: IRQ0 used by override.
[    0.111692] ACPI: IRQ9 used by override.

IRQ7 should have a corresponding entry as IRQ9 has:

https://github.com/linux-surface/acpidumps/blob/4da0148744164cea0c924dab92f45842fde03177/surface_laptop_4_amd/apic.dsl#L178

          Subtable Type : 02 [Interrupt Source Override]
                 Length : 0A
                    Bus : 00
                 Source : 07
              Interrupt : 00000007
  Flags (decoded below) : 000F
               Polarity : 3
           Trigger Mode : 3

> Sachi can probably walk you through this a bit better as she's the one

> who tracked this down. See also [1, 2] and following comments.


Impressive detective work!

Sachi, can you please try the hack below to confirm the above?

It's not meant to be a solution, but it's the most trivial way to
validate this.

I'm pretty sure that Windows on Surface does not care about the PIC at
all. Whether that's on purpose to safe power or just because Windows
ignores the PIC completely by now does not matter at all. No idea how
that repeated poking on the PIC makes it come alive either and TBH, I
don't care too much about it simply because Linux is able to cope with a
missing PIC as long as the ACPI tables are correct.

I'm way too tired to think about a proper solution for that problem and
I noticed another related issue in that dmesg output:

[    0.272448] Failed to register legacy timer interrupt

It's not a problem which causes failures, but it's related to the
missing PIC.

Needs some more thoughts with brain awake...

Thanks,

        tglx
---
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -21,6 +21,7 @@
 #include <linux/efi-bgrt.h>
 #include <linux/serial_core.h>
 #include <linux/pgtable.h>
+#include <linux/dmi.h>
 
 #include <asm/e820/api.h>
 #include <asm/irqdomain.h>
@@ -1155,6 +1156,17 @@ static void __init mp_config_acpi_legacy
 	}
 }
 
+static const struct dmi_system_id surface_quirk[] __initconst = {
+	{
+		.ident = "Microsoft Surface Laptop 4 (AMD)",
+		.matches = {
+			DMI_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
+			DMI_MATCH(DMI_PRODUCT_SKU, "Surface_Laptop_4_1952:1953")
+		},
+	},
+	{}
+};
+
 /*
  * Parse IOAPIC related entries in MADT
  * returns 0 on success, < 0 on error
@@ -1212,6 +1224,11 @@ static int __init acpi_parse_madt_ioapic
 		acpi_sci_ioapic_setup(acpi_gbl_FADT.sci_interrupt, 0, 0,
 				      acpi_gbl_FADT.sci_interrupt);
 
+	if (dmi_check_system(surface_quirk)) {
+		pr_warn("Surface hack: Override irq 7\n");
+		mp_override_legacy_irq(7, 3, 3, 7);
+	}
+
 	/* Fill in identity legacy mappings where no override */
 	mp_config_acpi_legacy_irqs();
Andy Shevchenko May 18, 2021, 8:28 a.m. UTC | #19
On Tue, May 18, 2021 at 01:27:02AM +0200, Thomas Gleixner wrote:
> On Mon, May 17 2021 at 21:25, Maximilian Luz wrote:

> > On 5/17/21 8:40 PM, Thomas Gleixner wrote:

> >> Can you please add "apic=verbose" to the kernel command line and provide

> >> full dmesg output for a kernel w/o your patch and one with your patch

> >> applied?

> >

> > I don't actually own an affected device, but I'm sure Sachi can provide

> > you with that.

> 

> Ok.

> 

> > As far as we can tell, due to the NULL PIC being chosen nr_legacy_irqs()

> > returns 0. That in turn causes mp_check_pin_attr() to return false

> > because is_level and active_low don't seem to match the expected

> > values.

> 

> Ok.

> 

> > That check is essentially ignored if nr_legacy_irqs() returns a high

> > enough value.

> 

> Close enough.

> 

> > I guess that might also be a firmware bug here? Not sure where the

> > expected values come from.

> 

> They come from the interrupt override ACPI table and if not supplied

> then irq 0-15 is preset with default values, which are type=edge and

> polarity=high, i.e.  the opposite of what the failing driver wants.

> 

> The ACPI table lacks an override entry for IRQ7. I looked at one of the

> dmesg files in that github thread and that has overrides:

> 

> [    0.111674] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)

> [    0.111681] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level)

> [    0.111688] ACPI: IRQ0 used by override.

> [    0.111692] ACPI: IRQ9 used by override.

> 

> IRQ7 should have a corresponding entry as IRQ9 has:

> 

> https://github.com/linux-surface/acpidumps/blob/4da0148744164cea0c924dab92f45842fde03177/surface_laptop_4_amd/apic.dsl#L178

> 

>           Subtable Type : 02 [Interrupt Source Override]

>                  Length : 0A

>                     Bus : 00

>                  Source : 07

>               Interrupt : 00000007

>   Flags (decoded below) : 000F

>                Polarity : 3

>            Trigger Mode : 3

> 

> > Sachi can probably walk you through this a bit better as she's the one

> > who tracked this down. See also [1, 2] and following comments.

> 

> Impressive detective work!

> 

> Sachi, can you please try the hack below to confirm the above?

> 

> It's not meant to be a solution, but it's the most trivial way to

> validate this.

> 

> I'm pretty sure that Windows on Surface does not care about the PIC at

> all. Whether that's on purpose to safe power or just because Windows

> ignores the PIC completely by now does not matter at all. No idea how

> that repeated poking on the PIC makes it come alive either and TBH, I

> don't care too much about it simply because Linux is able to cope with a

> missing PIC as long as the ACPI tables are correct.

> 

> I'm way too tired to think about a proper solution for that problem and

> I noticed another related issue in that dmesg output:

> 

> [    0.272448] Failed to register legacy timer interrupt

> 

> It's not a problem which causes failures, but it's related to the

> missing PIC.


But ACPI has a pretty nice means about missing legacy hardware, it's called
Hardware Reduced mode. It excludes automatically the (legacy) PIC, PIT, etc.

OTOH it excludes ACPI power chip as well. I haven't looked into this, just
share my thoughts what else can be checked. (On Intel the MID devices use
that approach)

> Needs some more thoughts with brain awake...


-- 
With Best Regards,
Andy Shevchenko
Thomas Gleixner May 18, 2021, 3:45 p.m. UTC | #20
Sachi,

On Wed, May 19 2021 at 05:58, Sachi King wrote:
> On Tuesday, May 18, 2021 9:27:02 AM AEST Thomas Gleixner wrote:

>> On Mon, May 17 2021 at 21:25, Maximilian Luz wrote:

>> > On 5/17/21 8:40 PM, Thomas Gleixner wrote:

>> >> Can you please add "apic=verbose" to the kernel command line and provide

>> >> full dmesg output for a kernel w/o your patch and one with your patch

>> >> applied?

>

> I've linked to a dmesg with "apic=verbose" below with the irq 7 override hack 

> applied.  Would you still like a copy without either of these patches

> applied?


No, that's pretty conclusive.

> What's the convention for including a dmesg on the mailing list?  I've 

> included the copy via gist as pasting it into email seems incorrect, but 

> that's also probably not the correct convention.


That's fine.

>> Sachi, can you please try the hack below to confirm the above?

>>

>> It's not meant to be a solution, but it's the most trivial way to

>> validate this.

>

> I've applied that patch and the GPIO driver loads and functions.


It's exactly what I expected. So now for a proper solution. After
talking to Raphael it seems something like that quirk is the best option
we have. I'll try to come up with something less horrible. There is some
other minor issue I noticed, which I need to look at as well.

Thanks for digging into this and for testing!

       tglx
Sachi King May 18, 2021, 7:58 p.m. UTC | #21
On Tuesday, May 18, 2021 9:27:02 AM AEST Thomas Gleixner wrote:
> On Mon, May 17 2021 at 21:25, Maximilian Luz wrote:

> > On 5/17/21 8:40 PM, Thomas Gleixner wrote:

> >> Can you please add "apic=verbose" to the kernel command line and provide

> >> full dmesg output for a kernel w/o your patch and one with your patch

> >> applied?


I've linked to a dmesg with "apic=verbose" below with the irq 7 override hack 
applied.  Would you still like a copy without either of these patches applied?

What's the convention for including a dmesg on the mailing list?  I've 
included the copy via gist as pasting it into email seems incorrect, but 
that's also probably not the correct convention.

> Sachi, can you please try the hack below to confirm the above?

>

> It's not meant to be a solution, but it's the most trivial way to

> validate this.


I've applied that patch and the GPIO driver loads and functions.

The dmesg with the patch and "apic=verbose" I've placed at:
https://gist.github.com/nakato/08c6962a540d91b6f9597303d54bffe5

Thanks,
Sachi
diff mbox series

Patch

diff --git a/arch/x86/kernel/i8259.c b/arch/x86/kernel/i8259.c
index 282b4ee1339f..0da757c6b292 100644
--- a/arch/x86/kernel/i8259.c
+++ b/arch/x86/kernel/i8259.c
@@ -1,4 +1,7 @@ 
 // SPDX-License-Identifier: GPL-2.0
+
+#define pr_fmt(fmt) "i8259: " fmt
+
 #include <linux/linkage.h>
 #include <linux/errno.h>
 #include <linux/signal.h>
@@ -16,6 +19,7 @@ 
 #include <linux/io.h>
 #include <linux/delay.h>
 #include <linux/pgtable.h>
+#include <linux/dmi.h>
 
 #include <linux/atomic.h>
 #include <asm/timer.h>
@@ -298,11 +302,39 @@  static void unmask_8259A(void)
 	raw_spin_unlock_irqrestore(&i8259A_lock, flags);
 }
 
+/*
+ * DMI table to identify devices with quirky probe behavior. See comment in
+ * probe_8259A() for more details.
+ */
+static const struct dmi_system_id retry_probe_quirk_table[] = {
+	{
+		.ident = "Microsoft Surface Laptop 4 (AMD)",
+		.matches = {
+			DMI_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
+			DMI_MATCH(DMI_PRODUCT_SKU, "Surface_Laptop_4_1952:1953")
+		},
+	},
+	{}
+};
+
 static int probe_8259A(void)
 {
 	unsigned long flags;
 	unsigned char probe_val = ~(1 << PIC_CASCADE_IR);
 	unsigned char new_val;
+	unsigned int i, imax = 1;
+
+	/*
+	 * Some systems have a legacy PIC that doesn't immediately respond
+	 * after boot. We know it's there, we know it should respond and is
+	 * required for proper interrupt handling later on, so let's try a
+	 * couple of times.
+	 */
+	if (dmi_check_system(retry_probe_quirk_table)) {
+		pr_warn("system with broken legacy PIC detected, re-trying multiple times if necessary\n");
+		imax = 10;
+	}
+
 	/*
 	 * Check to see if we have a PIC.
 	 * Mask all except the cascade and read
@@ -312,15 +344,24 @@  static int probe_8259A(void)
 	 */
 	raw_spin_lock_irqsave(&i8259A_lock, flags);
 
-	outb(0xff, PIC_SLAVE_IMR);	/* mask all of 8259A-2 */
-	outb(probe_val, PIC_MASTER_IMR);
-	new_val = inb(PIC_MASTER_IMR);
-	if (new_val != probe_val) {
-		printk(KERN_INFO "Using NULL legacy PIC\n");
+	for (i = 0; i < imax; i++) {
+		outb(0xff, PIC_SLAVE_IMR);	/* mask all of 8259A-2 */
+		outb(probe_val, PIC_MASTER_IMR);
+		new_val = inb(PIC_MASTER_IMR);
+		if (new_val == probe_val)
+			break;
+	}
+
+	if (i == imax) {
+		pr_info("using NULL legacy PIC\n");
 		legacy_pic = &null_legacy_pic;
 	}
 
 	raw_spin_unlock_irqrestore(&i8259A_lock, flags);
+
+	if (imax > 1 && i < imax)
+		pr_info("got legacy PIC after %d tries\n", i + 1);
+
 	return nr_legacy_irqs();
 }