diff mbox

test:linux-generic: run odp_scheduling in process mode

Message ID CAKb83kafPTvpzvsi=jmbTyUUmmP_4UGVpztev=9sM=hUna1kMQ@mail.gmail.com
State New
Headers show

Commit Message

Bill Fischofer Aug. 4, 2016, 4:41 p.m. UTC
On Thu, Aug 4, 2016 at 11:17 AM, Brian Brooks <brian.brooks@linaro.org>
wrote:

> On 08/04 11:01:09, Bill Fischofer wrote:

> > On Thu, Aug 4, 2016 at 10:59 AM, Mike Holmes <mike.holmes@linaro.org>

> wrote:

> >

> > >

> > >

> > > On 4 August 2016 at 11:47, Bill Fischofer <bill.fischofer@linaro.org>

> > > wrote:

> > >

> > >>

> > >> On Thu, Aug 4, 2016 at 10:36 AM, Mike Holmes <mike.holmes@linaro.org>

> > >> wrote:

> > >>

> > >>> On my vanilla x86 I don't get any issues, keen to get this in and

> have

> > >>> CI run it on lots of HW to see what happens, many of the other tests

> > >>> completely fail in process mode so we will expose a lot as we add

> them I

> > >>> think.

> > >>>

> > >>> On 4 August 2016 at 11:33, Bill Fischofer <bill.fischofer@linaro.org

> >

> > >>> wrote:

> > >>>

> > >>>>

> > >>>>

> > >>>> On Thu, Aug 4, 2016 at 10:26 AM, Brian Brooks <

> brian.brooks@linaro.org>

> > >>>> wrote:

> > >>>>

> > >>>>> Reviewed-by: Brian Brooks <brian.brooks@linaro.org>

> > >>>>>

> > >>>>> On 08/04 09:18:14, Mike Holmes wrote:

> > >>>>> > +ret=0

> > >>>>> > +

> > >>>>> > +run()

> > >>>>> > +{

> > >>>>> > +     echo odp_scheduling_run_proc starts with $1 worker threads

> > >>>>> > +     echo =====================================================

> > >>>>> > +

> > >>>>> > +     $PERFORMANCE/odp_scheduling${EXEEXT} --odph_proc -c $1 ||

> > >>>>> ret=1

> > >>>>> > +}

> > >>>>> > +

> > >>>>> > +run 1

> > >>>>> > +run 8

> > >>>>> > +

> > >>>>> > +exit $ret

> > >>>>>

> > >>>>> Seeing this randomly in both multithread and multiprocess modes:

> > >>>>>

> > >>>>

> > >>>> Before or after you apply this patch? What environment are you

> seeing

> > >>>> these errors in. They should definitely not be happening.

> > >>>>

> > >>>>

> > >>>>>

> > >>>>> ../../../odp/platform/linux-generic/odp_queue.c:328:odp_

> queue_destroy():queue

> > >>>>> "sched_00_07" not empty

> > >>>>> ../../../odp/platform/linux-generic/odp_schedule.c:271:

> schedule_term_global():Queue

> > >>>>> not empty

> > >>>>> ../../../odp/platform/linux-generic/odp_schedule.c:294:

> schedule_term_global():Pool

> > >>>>> destroy fail.

> > >>>>> ../../../odp/platform/linux-generic/odp_init.c:188:_odp_

> term_global():ODP

> > >>>>> schedule term failed.

> > >>>>> ../../../odp/platform/linux-generic/odp_queue.c:170:odp_

> queue_term_global():Not

> > >>>>> destroyed queue: sched_00_07

> > >>>>> ../../../odp/platform/linux-generic/odp_init.c:195:_odp_

> term_global():ODP

> > >>>>> queue term failed.

> > >>>>> ../../../odp/platform/linux-generic/odp_pool.c:149:odp_

> pool_term_global():Not

> > >>>>> destroyed pool: odp_sched_pool

> > >>>>> ../../../odp/platform/linux-generic/odp_pool.c:149:odp_

> pool_term_global():Not

> > >>>>> destroyed pool: msg_pool

> > >>>>> ../../../odp/platform/linux-generic/odp_init.c:202:_odp_

> term_global():ODP

> > >>>>> buffer pool term failed.

> > >>>>> ~/odp_incoming/odp_build/test/common_plat/performance$ echo $?

> > >>>>> 0

> > >>>>>

> > >>>>>

> > >> Looks like we have a real issue that somehow creeped into master. I

> can

> > >> sporadically reproduce these same errors on my x86 system.  It looks

> like

> > >> this is also present in the monarch_lts branch.

> > >>

> > >

> > >

> > > I think that we agreed that Monarch would not support Process mode

> becasue

> > > we never tested for it, but for TgrM we need to start fixing it.

> > >

> >

> > Unfortunately the issue Brian identified has nothing to do with process

> > mode. This happens in regular pthread mode on all levels past v1.10.0.0

> as

> > far as I can see.

>

> The issue seems to emerge only under high event rates. The application asks

> for more work, but none will be scheduled. However, there actually will be

> work in the queue. So, the teardown will fail because the queue is not

> empty.

> There may be a disconnect between the scheduling and the queueing or some

> other synchronization related bug. I think I've seen something similar on

> an ARM platform, so it may be architecture independent.

>


Well, now that I'm trying to find the root issue, it's proving elusive. I
was able to get this failure in < 10 runs before but now it doesn't want to
show itself.  If you can repro this more readily, can you get a core dump
of the failure?  I've been running with this patch:

---
From ba1fa0eb943fa7a3a3c9202b9e5bf5fc2ed5d1f4 Mon Sep 17 00:00:00 2001
From: Bill Fischofer <bill.fischofer@linaro.org>

Date: Thu, 4 Aug 2016 11:38:01 -0500
Subject: [PATCH] debug: abort on cleanup errors

Signed-off-by: Bill Fischofer <bill.fischofer@linaro.org>

---
 test/performance/odp_scheduling.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

-- 
2.7.4
diff mbox

Patch

diff --git a/test/performance/odp_scheduling.c
b/test/performance/odp_scheduling.c
index c575b70..7505df9 100644
--- a/test/performance/odp_scheduling.c
+++ b/test/performance/odp_scheduling.c
@@ -785,6 +785,7 @@  int main(int argc, char *argv[])
  char cpumaskstr[ODP_CPUMASK_STR_SIZE];
  odp_pool_param_t params;
  int ret = 0;
+ int rc = 0;
  odp_instance_t instance;
  odph_odpthread_params_t thr_params;

@@ -953,15 +954,17 @@  int main(int argc, char *argv[])

  for (j = 0; j < QUEUES_PER_PRIO; j++) {
  queue = globals->queue[i][j];
- odp_queue_destroy(queue);
+ rc += odp_queue_destroy(queue);
  }
  }

- odp_shm_free(shm);
- odp_queue_destroy(plain_queue);
- odp_pool_destroy(pool);
- odp_term_local();
- odp_term_global(instance);
+ rc += odp_shm_free(shm);
+ rc += odp_queue_destroy(plain_queue);
+ rc += odp_pool_destroy(pool);
+ rc += odp_term_local();
+ rc += odp_term_global(instance);
+ if (rc != 0)
+ abort();

  return ret;
 }