diff mbox

linux-generic: schedule: simplify wait logic to avoid clang issues

Message ID 1467381354-13151-1-git-send-email-bill.fischofer@linaro.org
State Superseded
Headers show

Commit Message

Bill Fischofer July 1, 2016, 1:55 p.m. UTC
Resolve Bug https://bugs.linaro.org/show_bug.cgi?id=2387 by ensuring that
t1 variable is see as initialized in all paths in the SP Scheduler.
Previous logic failed due to an apparent issue with clang.

Signed-off-by: Bill Fischofer <bill.fischofer@linaro.org>
---
 platform/linux-generic/odp_schedule_sp.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

Comments

Bill Fischofer July 1, 2016, 3 p.m. UTC | #1
On Fri, Jul 1, 2016 at 9:50 AM, Savolainen, Petri (Nokia - FI/Espoo) <
petri.savolainen@nokia-bell-labs.com> wrote:

>
>
> This is exactly what we should not do - hide a bug in one compiler version
> with additional code. It results lower performance for all compiler targets
> and code that is unnecessary cryptic to maintain.
>

I don't think you've analyzed this properly. The comparison is there
regardless because of the odp_time_cmp() needed to tell if the wait has
expired--that's in the original code.


>
>
> > -----Original Message-----
> > From: lng-odp [mailto:lng-odp-bounces@lists.linaro.org] On Behalf Of
> Bill
> > Fischofer
> > Sent: Friday, July 01, 2016 4:56 PM
> > To: lng-odp@lists.linaro.org
> > Subject: [lng-odp] [PATCH] linux-generic: schedule: simplify wait logic
> to
> > avoid clang issues
> >
> > Resolve Bug https://bugs.linaro.org/show_bug.cgi?id=2387 by ensuring
> that
> > t1 variable is see as initialized in all paths in the SP Scheduler.
> > Previous logic failed due to an apparent issue with clang.
> >
> > Signed-off-by: Bill Fischofer <bill.fischofer@linaro.org>
> > ---
> >  platform/linux-generic/odp_schedule_sp.c | 15 ++++++---------
> >  1 file changed, 6 insertions(+), 9 deletions(-)
> >
> > diff --git a/platform/linux-generic/odp_schedule_sp.c b/platform/linux-
> > generic/odp_schedule_sp.c
> > index 8c45123..e373d7f 100644
> > --- a/platform/linux-generic/odp_schedule_sp.c
> > +++ b/platform/linux-generic/odp_schedule_sp.c
> > @@ -364,10 +364,9 @@ static uint64_t schedule_wait_time(uint64_t ns)
> >  }
> >
> >  static int schedule_multi(odp_queue_t *from, uint64_t wait,
> > -                       odp_event_t events[], int max_events)
> > +                       odp_event_t events[], int max_events ODP_UNUSED)
> >  {
> > -     (void)max_events;
> > -     int update_t1 = 1;
> > +     odp_time_t t1 = ODP_TIME_NULL;
>
> In 32 bit system, this translates to ...
> t1[0] = 0;
> t1[1] = 0;
> t1[2] = 0;
> t1[3] = 0;
>
> ... instead of single
>
> update_t1 = 1;
>
>
> >
> >       if (sched_local.cmd) {
> >               /* Continue scheduling if queue is not empty */
> > @@ -384,7 +383,6 @@ static int schedule_multi(odp_queue_t *from, uint64_t
> > wait,
> >               sched_cmd_t *cmd;
> >               uint32_t qi;
> >               int num;
> > -             odp_time_t t1;
> >
> >               cmd = sched_cmd(NUM_PRIO);
> >
> > @@ -410,16 +408,15 @@ static int schedule_multi(odp_queue_t *from,
> > uint64_t wait,
> >                       if (wait == ODP_SCHED_WAIT)
> >                               continue;
> >
> > -                     if (update_t1) {
> > +                     if (odp_time_cmp(odp_time_local(), t1) < 0)
> > +                             continue;


> This generates many additional cycles, since the compare expands over 4
> words in 32 bit systems...
>
> if (t1.tv_sec.upper  == t2.tv_sec.upper && t1.tv_sec.lower ==
> t2.tv_sec.lower &&
>     t1.tv_nsec.upper == t2.tv_nsec.upper && t1.tv_nsec.lower ==
> t2.tv_nsec.lower)
>
>
> The #ifdef pays off this time. It's simpler, faster and explicit about the
> clang bug.
>

No it does not. The odp_time_cmp() is there in the original code. What's
more efficient now is that for every iteration of the loop you no longer
have the test of update_t1, so this is a net efficiency gain. Now we do the
odp_time_tmp(t1, ODP_TIME_NULL) exactly twice (the first and last iteration
of the loop), and only if we are doing a timed schedule call.

For WAIT and NO_WAIT calls the delta is the init of t1 to ODP_TIME_NULL vs.
update_t1 to 1, however for timed waits the payoff is significant since
presumably the loop will iterate many times during a timed wait before
timing out.


>
>
> -Petri
>
>
>
> > +
> > +                     if (odp_time_cmp(t1, ODP_TIME_NULL) == 0) {
> >                               t1 = odp_time_sum(odp_time_local(),
> >
>  odp_time_local_from_ns(wait));
> > -                             update_t1 = 0;
> >                               continue;
> >                       }
> >
> > -                     if (odp_time_cmp(odp_time_local(), t1) < 0)
> > -                             continue;
> > -
> >                       return 0;
> >               }
> >
> > --
> > 2.7.4
> >
> > _______________________________________________
> > lng-odp mailing list
> > lng-odp@lists.linaro.org
> > https://lists.linaro.org/mailman/listinfo/lng-odp
>
Bill Fischofer July 5, 2016, 2:34 a.m. UTC | #2
On Mon, Jul 4, 2016 at 11:35 AM, Savolainen, Petri (Nokia - FI/Espoo) <
petri.savolainen@nokia-bell-labs.com> wrote:

>
>
> From: Bill Fischofer [mailto:bill.fischofer@linaro.org]
> Sent: Friday, July 01, 2016 6:01 PM
> To: Savolainen, Petri (Nokia - FI/Espoo) <
> petri.savolainen@nokia-bell-labs.com>
> Cc: lng-odp@lists.linaro.org
> Subject: Re: [lng-odp] [PATCH] linux-generic: schedule: simplify wait
> logic to avoid clang issues
>
>
>
> On Fri, Jul 1, 2016 at 9:50 AM, Savolainen, Petri (Nokia - FI/Espoo) <
> petri.savolainen@nokia-bell-labs.com> wrote:
>
>
> This is exactly what we should not do - hide a bug in one compiler version
> with additional code. It results lower performance for all compiler targets
> and code that is unnecessary cryptic to maintain.
>
> I don't think you've analyzed this properly. The comparison is there
> regardless because of the odp_time_cmp() needed to tell if the wait has
> expired--that's in the original code.
>
>
>
> > -----Original Message-----
> > From: lng-odp [mailto:lng-odp-bounces@lists.linaro.org] On Behalf Of
> Bill
> > Fischofer
> > Sent: Friday, July 01, 2016 4:56 PM
> > To: lng-odp@lists.linaro.org
> > Subject: [lng-odp] [PATCH] linux-generic: schedule: simplify wait logic
> to
> > avoid clang issues
> >
> > Resolve Bug https://bugs.linaro.org/show_bug.cgi?id=2387 by ensuring
> that
> > t1 variable is see as initialized in all paths in the SP Scheduler.
> > Previous logic failed due to an apparent issue with clang.
> >
> > Signed-off-by: Bill Fischofer <bill.fischofer@linaro.org>
> > ---
> >  platform/linux-generic/odp_schedule_sp.c | 15 ++++++---------
> >  1 file changed, 6 insertions(+), 9 deletions(-)
> >
> > diff --git a/platform/linux-generic/odp_schedule_sp.c b/platform/linux-
> > generic/odp_schedule_sp.c
> > index 8c45123..e373d7f 100644
> > --- a/platform/linux-generic/odp_schedule_sp.c
> > +++ b/platform/linux-generic/odp_schedule_sp.c
> > @@ -364,10 +364,9 @@ static uint64_t schedule_wait_time(uint64_t ns)
> >  }
> >
> >  static int schedule_multi(odp_queue_t *from, uint64_t wait,
> > -                       odp_event_t events[], int max_events)
> > +                       odp_event_t events[], int max_events ODP_UNUSED)
> >  {
> > -     (void)max_events;
> > -     int update_t1 = 1;
> > +     odp_time_t t1 = ODP_TIME_NULL;
> In 32 bit system, this translates to ...
> t1[0] = 0;
> t1[1] = 0;
> t1[2] = 0;
> t1[3] = 0;
>
> ... instead of single
>
> update_t1 = 1;
>
>
> >
> >       if (sched_local.cmd) {
> >               /* Continue scheduling if queue is not empty */
> > @@ -384,7 +383,6 @@ static int schedule_multi(odp_queue_t *from, uint64_t
> > wait,
> >               sched_cmd_t *cmd;
> >               uint32_t qi;
> >               int num;
> > -             odp_time_t t1;
> >
> >               cmd = sched_cmd(NUM_PRIO);
> >
> > @@ -410,16 +408,15 @@ static int schedule_multi(odp_queue_t *from,
> > uint64_t wait,
> >                       if (wait == ODP_SCHED_WAIT)
> >                               continue;
> >
> > -                     if (update_t1) {
> > +                     if (odp_time_cmp(odp_time_local(), t1) < 0)
> > +                             continue;
>
> This generates many additional cycles, since the compare expands over 4
> words in 32 bit systems...
>
> if (t1.tv_sec.upper  == t2.tv_sec.upper && t1.tv_sec.lower ==
> t2.tv_sec.lower &&
>     t1.tv_nsec.upper == t2.tv_nsec.upper && t1.tv_nsec.lower ==
> t2.tv_nsec.lower)
>
>
> The #ifdef pays off this time. It's simpler, faster and explicit about the
> clang bug.
>
> No it does not. The odp_time_cmp() is there in the original code. What's
> more efficient now is that for every iteration of the loop you no longer
> have the test of update_t1, so this is a net efficiency gain. Now we do the
> odp_time_tmp(t1, ODP_TIME_NULL) exactly twice (the first and last iteration
> of the loop), and only if we are doing a timed schedule call.
>
> For WAIT and NO_WAIT calls the delta is the init of t1 to ODP_TIME_NULL
> vs. update_t1 to 1, however for timed waits the payoff is significant since
> presumably the loop will iterate many times during a timed wait before
> timing out.
>
>
>
> << reply to HTML mail >>
>
> if (update_t1) {
>    ...
> }
>
> if (odp_time_cmp(odp_time_local(), t1) < 0)
>     continue;
>
> This condition is very easy for compiler and CPU branch predictor HW to
> predict correctly (update_t1 is 0 in practice always). Today many CPUs are
> multi-issue machines that may execute both true and false branches in
> parallel and then continue the true branch while the false branch will be
> dropped when evaluation of the condition has completed (here 1 cycle
> later). So, if you measure the performance you may not even see the
> (single) cycle which goes to a static if condition evaluation like this.
>

Whether or not it's easy to predict it's easier still to not execute this
test at all.


>
>
> if (odp_time_cmp(odp_time_local(), t1) < 0)
>      continue;
>
> if (odp_time_cmp(t1, ODP_TIME_NULL) == 0) {
>      t1 = odp_time_sum(odp_time_local(),
>                        odp_time_local_from_ns(wait));
>      continue;
> }
>
> This is harder to predict (at least in compile time), since outcome of
> time_local() is not static (contains a system call, etc).
>

There's nothing to predict here because the odp_time_cmp() call will
execute and this test will succeed in almost all cases for a timed wait. It
will fail (requiring the following test to be executed) exactly twice: on
the first iteration of the loop when t1 is initialized and on the final
iteration in the event the timed wait times out.  Since this comparison is
a series of && tests, if it fails it will most likely fail on the first
such comparison so there's no need to compare four words on a 32-bit
machine.


>
> So, this would not likely be faster in the timeout case. It would be
> likely slower in the common case (WAIT / NO_WAIT) due to more data
> accesses. The logic is harder to read/maintain, and it differs from the
> default scheduler where "update_t1" is called "first" (and clang is happy).
>

In the spirit of "de gustibus non est disputandem" I've posted a v2 of the
patch which simply moves the declaration of t1 out of the inner while loop
and this seems to satisfy clang so there's no change to the original logic
and this doesn't need #ifdefs.


>
> -Petri
>
>
diff mbox

Patch

diff --git a/platform/linux-generic/odp_schedule_sp.c b/platform/linux-generic/odp_schedule_sp.c
index 8c45123..e373d7f 100644
--- a/platform/linux-generic/odp_schedule_sp.c
+++ b/platform/linux-generic/odp_schedule_sp.c
@@ -364,10 +364,9 @@  static uint64_t schedule_wait_time(uint64_t ns)
 }
 
 static int schedule_multi(odp_queue_t *from, uint64_t wait,
-			  odp_event_t events[], int max_events)
+			  odp_event_t events[], int max_events ODP_UNUSED)
 {
-	(void)max_events;
-	int update_t1 = 1;
+	odp_time_t t1 = ODP_TIME_NULL;
 
 	if (sched_local.cmd) {
 		/* Continue scheduling if queue is not empty */
@@ -384,7 +383,6 @@  static int schedule_multi(odp_queue_t *from, uint64_t wait,
 		sched_cmd_t *cmd;
 		uint32_t qi;
 		int num;
-		odp_time_t t1;
 
 		cmd = sched_cmd(NUM_PRIO);
 
@@ -410,16 +408,15 @@  static int schedule_multi(odp_queue_t *from, uint64_t wait,
 			if (wait == ODP_SCHED_WAIT)
 				continue;
 
-			if (update_t1) {
+			if (odp_time_cmp(odp_time_local(), t1) < 0)
+				continue;
+
+			if (odp_time_cmp(t1, ODP_TIME_NULL) == 0) {
 				t1 = odp_time_sum(odp_time_local(),
 						  odp_time_local_from_ns(wait));
-				update_t1 = 0;
 				continue;
 			}
 
-			if (odp_time_cmp(odp_time_local(), t1) < 0)
-				continue;
-
 			return 0;
 		}