[0/7] xen/events: bug fixes and some diagnostic aids

Message ID	20210206104932.29064-1-jgross@suse.com
Headers	show Return-Path: <linux-scsi-owner@kernel.org> From: Juergen Gross <jgross@suse.com> To: xen-devel@lists.xenproject.org, linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, netdev@vger.kernel.org, linux-scsi@vger.kernel.org Cc: Juergen Gross <jgross@suse.com>, Boris Ostrovsky <boris.ostrovsky@oracle.com>, Stefano Stabellini <sstabellini@kernel.org>, stable@vger.kernel.org, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>, =?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com>, Jens Axboe <axboe@kernel.dk>, Wei Liu <wei.liu@kernel.org>, Paul Durrant <paul@xen.org>, "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org> Subject: [PATCH 0/7] xen/events: bug fixes and some diagnostic aids Date: Sat, 6 Feb 2021 11:49:25 +0100 Message-Id: <20210206104932.29064-1-jgross@suse.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	xen/events: bug fixes and some diagnostic aids \| expand [0/7] xen/events: bug fixes and some diagnostic aids [4/7] xen/events: link interdomain events to associated xenbus device

Jürgen Groß Feb. 6, 2021, 10:49 a.m. UTC

The first three patches are fixes for XSA-332. The avoid WARN splats
and a performance issue with interdomain events.

Patches 4 and 5 are some additions to event handling in order to add
some per pv-device statistics to sysfs and the ability to have a per
backend device spurious event delay control.

Patches 6 and 7 are minor fixes I had lying around.

Juergen Gross (7):
  xen/events: reset affinity of 2-level event initially
  xen/events: don't unmask an event channel when an eoi is pending
  xen/events: fix lateeoi irq acknowledgement
  xen/events: link interdomain events to associated xenbus device
  xen/events: add per-xenbus device event statistics and settings
  xen/evtch: use smp barriers for user event ring
  xen/evtchn: read producer index only once

 drivers/block/xen-blkback/xenbus.c  |   2 +-
 drivers/net/xen-netback/interface.c |  16 ++--
 drivers/xen/events/events_2l.c      |  20 +++++
 drivers/xen/events/events_base.c    | 133 ++++++++++++++++++++++------
 drivers/xen/evtchn.c                |   6 +-
 drivers/xen/pvcalls-back.c          |   4 +-
 drivers/xen/xen-pciback/xenbus.c    |   2 +-
 drivers/xen/xen-scsiback.c          |   2 +-
 drivers/xen/xenbus/xenbus_probe.c   |  66 ++++++++++++++
 include/xen/events.h                |   7 +-
 include/xen/xenbus.h                |   7 ++
 11 files changed, 217 insertions(+), 48 deletions(-)

Julien Grall Feb. 8, 2021, 9:54 a.m. UTC | #1

On 08/02/2021 09:41, Jürgen Groß wrote:
> On 08.02.21 10:11, Julien Grall wrote:
>> Hi Juergen,
>>
>> On 07/02/2021 12:58, Jürgen Groß wrote:
>>> On 06.02.21 19:46, Julien Grall wrote:
>>>> Hi Juergen,
>>>>
>>>> On 06/02/2021 10:49, Juergen Gross wrote:
>>>>> The first three patches are fixes for XSA-332. The avoid WARN splats
>>>>> and a performance issue with interdomain events.
>>>>
>>>> Thanks for helping to figure out the problem. Unfortunately, I still 
>>>> see reliably the WARN splat with the latest Linux master 
>>>> (1e0d27fce010) + your first 3 patches.
>>>>
>>>> I am using Xen 4.11 (1c7d984645f9) and dom0 is forced to use the 2L 
>>>> events ABI.
>>>>
>>>> After some debugging, I think I have an idea what's went wrong. The 
>>>> problem happens when the event is initially bound from vCPU0 to a 
>>>> different vCPU.
>>>>
>>>>  From the comment in xen_rebind_evtchn_to_cpu(), we are masking the 
>>>> event to prevent it being delivered on an unexpected vCPU. However, 
>>>> I believe the following can happen:
>>>>
>>>> vCPU0                | vCPU1
>>>>                  |
>>>>                  | Call xen_rebind_evtchn_to_cpu()
>>>> receive event X            |
>>>>                  | mask event X
>>>>                  | bind to vCPU1
>>>> <vCPU descheduled>        | unmask event X
>>>>                  |
>>>>                  | receive event X
>>>>                  |
>>>>                  | handle_edge_irq(X)
>>>> handle_edge_irq(X)        |  -> handle_irq_event()
>>>>                  |   -> set IRQD_IN_PROGRESS
>>>>   -> set IRQS_PENDING        |
>>>>                  |   -> evtchn_interrupt()
>>>>                  |   -> clear IRQD_IN_PROGRESS
>>>>                  |  -> IRQS_PENDING is set
>>>>                  |  -> handle_irq_event()
>>>>                  |   -> evtchn_interrupt()
>>>>                  |     -> WARN()
>>>>                  |
>>>>
>>>> All the lateeoi handlers expect a ONESHOT semantic and 
>>>> evtchn_interrupt() is doesn't tolerate any deviation.
>>>>
>>>> I think the problem was introduced by 7f874a0447a9 ("xen/events: fix 
>>>> lateeoi irq acknowledgment") because the interrupt was disabled 
>>>> previously. Therefore we wouldn't do another iteration in 
>>>> handle_edge_irq().
>>>
>>> I think you picked the wrong commit for blaming, as this is just
>>> the last patch of the three patches you were testing.
>>
>> I actually found the right commit for blaming but I copied the 
>> information from the wrong shell :/. The bug was introduced by:
>>
>> c44b849cee8c ("xen/events: switch user event channels to lateeoi model")
>>
>>>
>>>> Aside the handlers, I think it may impact the defer EOI mitigation 
>>>> because in theory if a 3rd vCPU is joining the party (let say vCPU A 
>>>> migrate the event from vCPU B to vCPU C). So info->{eoi_cpu, 
>>>> irq_epoch, eoi_time} could possibly get mangled?
>>>>
>>>> For a fix, we may want to consider to hold evtchn_rwlock with the 
>>>> write permission. Although, I am not 100% sure this is going to 
>>>> prevent everything.
>>>
>>> It will make things worse, as it would violate the locking hierarchy
>>> (xen_rebind_evtchn_to_cpu() is called with the IRQ-desc lock held).
>>
>> Ah, right.
>>
>>>
>>> On a first glance I think we'll need a 3rd masking state ("temporarily
>>> masked") in the second patch in order to avoid a race with lateeoi.
>>>
>>> In order to avoid the race you outlined above we need an "event is being
>>> handled" indicator checked via test_and_set() semantics in
>>> handle_irq_for_port() and reset only when calling clear_evtchn().
>>
>> It feels like we are trying to workaround the IRQ flow we are using 
>> (i.e. handle_edge_irq()).
> 
> I'm not really sure this is the main problem here. According to your
> analysis the main problem is occurring when handling the event, not when
> handling the IRQ: the event is being received on two vcpus.

I don't think we can easily divide the two because we rely on the IRQ 
framework to handle the lifecycle of the event. So...

> 
> Our problem isn't due to the IRQ still being pending, but due it being
> raised again, which should happen for a one shot IRQ the same way.

... I don't really see how the difference matter here. The idea is to 
re-use what's already existing rather than trying to re-invent the wheel 
with an extra lock (or whatever we can come up).

Cheers,

Jürgen Groß Feb. 8, 2021, 10:22 a.m. UTC | #2

On 08.02.21 10:54, Julien Grall wrote:
> 
> 
> On 08/02/2021 09:41, Jürgen Groß wrote:
>> On 08.02.21 10:11, Julien Grall wrote:
>>> Hi Juergen,
>>>
>>> On 07/02/2021 12:58, Jürgen Groß wrote:
>>>> On 06.02.21 19:46, Julien Grall wrote:
>>>>> Hi Juergen,
>>>>>
>>>>> On 06/02/2021 10:49, Juergen Gross wrote:
>>>>>> The first three patches are fixes for XSA-332. The avoid WARN splats
>>>>>> and a performance issue with interdomain events.
>>>>>
>>>>> Thanks for helping to figure out the problem. Unfortunately, I 
>>>>> still see reliably the WARN splat with the latest Linux master 
>>>>> (1e0d27fce010) + your first 3 patches.
>>>>>
>>>>> I am using Xen 4.11 (1c7d984645f9) and dom0 is forced to use the 2L 
>>>>> events ABI.
>>>>>
>>>>> After some debugging, I think I have an idea what's went wrong. The 
>>>>> problem happens when the event is initially bound from vCPU0 to a 
>>>>> different vCPU.
>>>>>
>>>>>  From the comment in xen_rebind_evtchn_to_cpu(), we are masking the 
>>>>> event to prevent it being delivered on an unexpected vCPU. However, 
>>>>> I believe the following can happen:
>>>>>
>>>>> vCPU0                | vCPU1
>>>>>                  |
>>>>>                  | Call xen_rebind_evtchn_to_cpu()
>>>>> receive event X            |
>>>>>                  | mask event X
>>>>>                  | bind to vCPU1
>>>>> <vCPU descheduled>        | unmask event X
>>>>>                  |
>>>>>                  | receive event X
>>>>>                  |
>>>>>                  | handle_edge_irq(X)
>>>>> handle_edge_irq(X)        |  -> handle_irq_event()
>>>>>                  |   -> set IRQD_IN_PROGRESS
>>>>>   -> set IRQS_PENDING        |
>>>>>                  |   -> evtchn_interrupt()
>>>>>                  |   -> clear IRQD_IN_PROGRESS
>>>>>                  |  -> IRQS_PENDING is set
>>>>>                  |  -> handle_irq_event()
>>>>>                  |   -> evtchn_interrupt()
>>>>>                  |     -> WARN()
>>>>>                  |
>>>>>
>>>>> All the lateeoi handlers expect a ONESHOT semantic and 
>>>>> evtchn_interrupt() is doesn't tolerate any deviation.
>>>>>
>>>>> I think the problem was introduced by 7f874a0447a9 ("xen/events: 
>>>>> fix lateeoi irq acknowledgment") because the interrupt was disabled 
>>>>> previously. Therefore we wouldn't do another iteration in 
>>>>> handle_edge_irq().
>>>>
>>>> I think you picked the wrong commit for blaming, as this is just
>>>> the last patch of the three patches you were testing.
>>>
>>> I actually found the right commit for blaming but I copied the 
>>> information from the wrong shell :/. The bug was introduced by:
>>>
>>> c44b849cee8c ("xen/events: switch user event channels to lateeoi model")
>>>
>>>>
>>>>> Aside the handlers, I think it may impact the defer EOI mitigation 
>>>>> because in theory if a 3rd vCPU is joining the party (let say vCPU 
>>>>> A migrate the event from vCPU B to vCPU C). So info->{eoi_cpu, 
>>>>> irq_epoch, eoi_time} could possibly get mangled?
>>>>>
>>>>> For a fix, we may want to consider to hold evtchn_rwlock with the 
>>>>> write permission. Although, I am not 100% sure this is going to 
>>>>> prevent everything.
>>>>
>>>> It will make things worse, as it would violate the locking hierarchy
>>>> (xen_rebind_evtchn_to_cpu() is called with the IRQ-desc lock held).
>>>
>>> Ah, right.
>>>
>>>>
>>>> On a first glance I think we'll need a 3rd masking state ("temporarily
>>>> masked") in the second patch in order to avoid a race with lateeoi.
>>>>
>>>> In order to avoid the race you outlined above we need an "event is 
>>>> being
>>>> handled" indicator checked via test_and_set() semantics in
>>>> handle_irq_for_port() and reset only when calling clear_evtchn().
>>>
>>> It feels like we are trying to workaround the IRQ flow we are using 
>>> (i.e. handle_edge_irq()).
>>
>> I'm not really sure this is the main problem here. According to your
>> analysis the main problem is occurring when handling the event, not when
>> handling the IRQ: the event is being received on two vcpus.
> 
> I don't think we can easily divide the two because we rely on the IRQ 
> framework to handle the lifecycle of the event. So...
> 
>>
>> Our problem isn't due to the IRQ still being pending, but due it being
>> raised again, which should happen for a one shot IRQ the same way.
> 
> ... I don't really see how the difference matter here. The idea is to 
> re-use what's already existing rather than trying to re-invent the wheel 
> with an extra lock (or whatever we can come up).

The difference is that the race is occurring _before_ any IRQ is
involved. So I don't see how modification of IRQ handling would help.


Juergen

Julien Grall Feb. 8, 2021, 10:40 a.m. UTC | #3

Hi Juergen,

On 08/02/2021 10:22, Jürgen Groß wrote:
> On 08.02.21 10:54, Julien Grall wrote:
>> ... I don't really see how the difference matter here. The idea is to 
>> re-use what's already existing rather than trying to re-invent the 
>> wheel with an extra lock (or whatever we can come up).
> 
> The difference is that the race is occurring _before_ any IRQ is
> involved. So I don't see how modification of IRQ handling would help.

Roughly our current IRQ handling flow (handle_eoi_irq()) looks like:

if ( irq in progress )
{
   set IRQS_PENDING
   return;
}

do
{
   clear IRQS_PENDING
   handle_irq()
} while (IRQS_PENDING is set)

IRQ handling flow like handle_fasteoi_irq() looks like:

if ( irq in progress )
   return;

handle_irq()

The latter flow would catch "spurious" interrupt and ignore them. So it 
would handle nicely the race when changing the event affinity.

Cheers,

Jürgen Groß Feb. 8, 2021, 12:14 p.m. UTC | #4

On 08.02.21 11:40, Julien Grall wrote:
> Hi Juergen,
> 
> On 08/02/2021 10:22, Jürgen Groß wrote:
>> On 08.02.21 10:54, Julien Grall wrote:
>>> ... I don't really see how the difference matter here. The idea is to 
>>> re-use what's already existing rather than trying to re-invent the 
>>> wheel with an extra lock (or whatever we can come up).
>>
>> The difference is that the race is occurring _before_ any IRQ is
>> involved. So I don't see how modification of IRQ handling would help.
> 
> Roughly our current IRQ handling flow (handle_eoi_irq()) looks like:
> 
> if ( irq in progress )
> {
>    set IRQS_PENDING
>    return;
> }
> 
> do
> {
>    clear IRQS_PENDING
>    handle_irq()
> } while (IRQS_PENDING is set)
> 
> IRQ handling flow like handle_fasteoi_irq() looks like:
> 
> if ( irq in progress )
>    return;
> 
> handle_irq()
> 
> The latter flow would catch "spurious" interrupt and ignore them. So it 
> would handle nicely the race when changing the event affinity.

Sure? Isn't "irq in progress" being reset way before our "lateeoi" is
issued, thus having the same problem again? And I think we want to keep
the lateeoi behavior in order to be able to control event storms.


Juergen

Jürgen Groß Feb. 8, 2021, 1:58 p.m. UTC | #5

On 08.02.21 14:09, Julien Grall wrote:
> Hi Juergen,
> 
> On 08/02/2021 12:31, Jürgen Groß wrote:
>> On 08.02.21 13:16, Julien Grall wrote:
>>>
>>>
>>> On 08/02/2021 12:14, Jürgen Groß wrote:
>>>> On 08.02.21 11:40, Julien Grall wrote:
>>>>> Hi Juergen,
>>>>>
>>>>> On 08/02/2021 10:22, Jürgen Groß wrote:
>>>>>> On 08.02.21 10:54, Julien Grall wrote:
>>>>>>> ... I don't really see how the difference matter here. The idea 
>>>>>>> is to re-use what's already existing rather than trying to 
>>>>>>> re-invent the wheel with an extra lock (or whatever we can come up).
>>>>>>
>>>>>> The difference is that the race is occurring _before_ any IRQ is
>>>>>> involved. So I don't see how modification of IRQ handling would help.
>>>>>
>>>>> Roughly our current IRQ handling flow (handle_eoi_irq()) looks like:
>>>>>
>>>>> if ( irq in progress )
>>>>> {
>>>>>    set IRQS_PENDING
>>>>>    return;
>>>>> }
>>>>>
>>>>> do
>>>>> {
>>>>>    clear IRQS_PENDING
>>>>>    handle_irq()
>>>>> } while (IRQS_PENDING is set)
>>>>>
>>>>> IRQ handling flow like handle_fasteoi_irq() looks like:
>>>>>
>>>>> if ( irq in progress )
>>>>>    return;
>>>>>
>>>>> handle_irq()
>>>>>
>>>>> The latter flow would catch "spurious" interrupt and ignore them. 
>>>>> So it would handle nicely the race when changing the event affinity.
>>>>
>>>> Sure? Isn't "irq in progress" being reset way before our "lateeoi" is
>>>> issued, thus having the same problem again? 
>>>
>>> Sorry I can't parse this.
>>
>> handle_fasteoi_irq() will do nothing "if ( irq in progress )". When is
>> this condition being reset again in order to be able to process another
>> IRQ?
> It is reset after the handler has been called. See handle_irq_event().

Right. And for us this is too early, as we want the next IRQ being
handled only after we have called xen_irq_lateeoi().

> 
>> I believe this will be the case before our "lateeoi" handling is
>> becoming active (more precise: when our IRQ handler is returning to
>> handle_fasteoi_irq()), resulting in the possibility of the same race we
>> are experiencing now.
> 
> I am a bit confused what you mean by "lateeoi" handling is becoming 
> active. Can you clarify?

See above: the next call of the handler should be allowed only after
xen_irq_lateeoi() for the IRQ has been called.

If the handler is being called earlier we have the race resulting
in the WARN() splats.

> Note that are are other IRQ flows existing. We should have a look at 
> them before trying to fix thing ourself.

Fine with me, but it either needs to fit all use cases (interdomain,
IPI, real interrupts) or we need to have a per-type IRQ flow.

I think we should fix the issue locally first, then we can start to do
a thorough rework planning. Its not as if the needed changes with the
current flow would be so huge, and I'd really like to have a solution
rather sooner than later. Changing the IRQ flow might have other side
effects which need to be excluded by thorough testing.

> Although, the other issue I can see so far is handle_irq_for_port() will 
> update info->{eoi_cpu, irq_epoch, eoi_time} without any locking. But it 
> is not clear this is what you mean by "becoming active".

As long as a single event can't be handled on multiple cpus at the same
time, there is no locking needed.


Juergen

Julien Grall Feb. 8, 2021, 2:35 p.m. UTC | #6

On 08/02/2021 14:20, Julien Grall wrote:
>>>> I believe this will be the case before our "lateeoi" handling is
>>>> becoming active (more precise: when our IRQ handler is returning to
>>>> handle_fasteoi_irq()), resulting in the possibility of the same race we
>>>> are experiencing now.
>>>
>>> I am a bit confused what you mean by "lateeoi" handling is becoming 
>>> active. Can you clarify?
>>
>> See above: the next call of the handler should be allowed only after
>> xen_irq_lateeoi() for the IRQ has been called.
>>
>> If the handler is being called earlier we have the race resulting
>> in the WARN() splats.
> 
> I feel it is dislike to understand race with just words. Can you provide

Sorry I meant difficult rather than dislike.

Cheers,

[0/7] xen/events: bug fixes and some diagnostic aids

Message

Comments