linux-generic: sched fix polling input queue

Message ID 20150401152956.GA16425@localhost
State New
Headers show

Commit Message

Stuart Haslam April 1, 2015, 3:29 p.m.
On Wed, Apr 01, 2015 at 05:47:51PM +0300, Maxim Uvarov wrote:
> On 04/01/15 16:57, Stuart Haslam wrote:
> >On Wed, Apr 01, 2015 at 04:30:59PM +0300, Maxim Uvarov wrote:
> >>Commit:
> >>  8fa64fb9fc4e43fbd9d3c226ed89229863bdb771
> >>  linux-generic: scheduler: restructured queue and pktio integration
> >>Implement race with schedule termination and polling input queue.
> >>This patch locks pktio while doing poll to prevent destroy linked
> >>queue in the middle of this poll.
> >>
> >Admittedly I've not looked in detail at the terminate sequence after the
> >scheduler changes, so don't really understand what you're fixing, but
> >this feels like a workaround rather than a fix. Shouldn't the pktin
> >queue have been removed from the scheduler before it's closed? What's
> >the sequence that leads to the problem?
> as I understand that right one thread goes to
> schedule()->pktin_poll(sched_cmd->pe))
> then successful do:
> odp_pktio_recv(entry->s.handle, pkt_tbl, QUEUE_MULTI_MAX);
> after that other thread calls terminate and bellow in the pktio_poll code:
> queue_enq_multi(qentry, hdr_tbl, num_enq);
> qentry here is corrupted due to it has been destroyed by other thread.

I see from the trace that you can reproduce this with the odp_pktio.c
validation test, which is single threaded so there's no thread race.

> Because of qentry is linked to pktio entry we have to lock pktio
> entry for that
> queue to make sure that is was not modified while pktin_poll execution.
> I did make check from the root to find that problem (it occurs about
> 1 of 10 run times).
> I sent back trace some to mailing list some time ago:
> ore was generated by `./test/validation/odp_pktio'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x0000000000411dc4 in odp_atomic_fetch_inc_u32 (atom=0x2baaaadfff00)
> at ./include/odp/atomic.h:70
> 70        return __atomic_fetch_add(&atom->v, 1, __ATOMIC_RELAXED);
> (gdb) bt
> #0  0x0000000000411dc4 in odp_atomic_fetch_inc_u32 (atom=0x2baaaadfff00)
> at ./include/odp/atomic.h:70
> #1  0x0000000000411e8a in odp_ticketlock_lock
> (ticketlock=0x2baaaadfff00) at odp_ticketlock.c:28
> #2  0x000000000040f0f8 in queue_enq_multi (queue=0x2baaaadfff00,
> buf_hdr=0x7fff1fccb0b0, num=1) at odp_queue.c:376
> #3  0x000000000040987d in pktin_poll (entry=0x2aaaab200600) at
> odp_packet_io.c:713
> #4  0x0000000000410378 in schedule (out_queue=0x0,
> out_ev=0x7fff1fccb1d8, max_num=1, max_deq=4) at odp_schedule.c:455
> #5  0x000000000041050a in schedule_loop (out_queue=0x0, wait=1,
> out_ev=0x7fff1fccb1d8, max_num=1, max_deq=4) at odp_schedule.c:518
> #6  0x00000000004105a4 in odp_schedule (out_queue=0x0, wait=1) at
> odp_schedule.c:551
> #7  0x0000000000402b83 in destroy_inq (pktio=0x2) at odp_pktio.c:320

So the problem sequence is this;

 1. destroy_inq() calls odp_pktio_inq_remdef()
 2. odp_pktio_inq_remdef() sets the pktio's inq_default to
    ODP_QUEUE_INVALID and does nothing to remove the queue from
    scheduler (not sure if it should?)
 3. destroy_inq() calls odp_schedule()
 4. odp_schedule() dequeues the event to poll the pktin queue and then
    calls pktin_poll()
 5. pktin_poll() attempts to fetch some packets from the pktin, and if
    it receives any, attempts to enqueue them to using inq_default,
    which by this point is ODP_QUEUE_INVALID.

So there's a fundamental breakage. I'm not sure yet how it should be
fixed but this patch will make it go away;

The race is that you'll only get the crash if some packets were actually
 received between the last time the validation test itself called
odp_schedule() and calling destroy_inq(). You'd never see it on the
linux-generic "loop" interface as traffic doesn't just mysteriously
appear there.


diff --git a/platform/linux-generic/odp_packet_io.c b/platform/linux-generic/odp_packet_io.c
index 15335d1..ae1de3c 100644
--- a/platform/linux-generic/odp_packet_io.c
+++ b/platform/linux-generic/odp_packet_io.c
@@ -685,6 +685,9 @@  int pktin_poll(pktio_entry_t *entry)
        if (odp_unlikely(is_free(entry)))
                return -1;
+       if (entry->s.inq_default == ODP_QUEUE_INVALID)
+               return -1;
        num = odp_pktio_recv(entry->s.handle, pkt_tbl, QUEUE_MULTI_MAX);
        if (num < 0) {