[1/2] Introduce prefetch-minimum stride option

Message ID 1516628770-25036-2-git-send-email-luis.machado@linaro.org
State New
Headers show
Series
  • Add a couple new options to control loop prefetch pass
Related show

Commit Message

Luis Machado Jan. 22, 2018, 1:46 p.m.
This patch adds a new option to control the minimum stride, for a memory
reference, after which the loop prefetch pass may issue software prefetch
hints for. There are two motivations:

* Make the pass less aggressive, only issuing prefetch hints for bigger strides
that are more likely to benefit from prefetching. I've noticed a case in cpu2017
where we were issuing thousands of hints, for example.

* For processors that have a hardware prefetcher, like Falkor, it allows the
loop prefetch pass to defer prefetching of smaller (less than the threshold)
strides to the hardware prefetcher instead. This prevents conflicts between
the software prefetcher and the hardware prefetcher.

I've noticed considerable reduction in the number of prefetch hints and
slightly positive performance numbers. This aligns GCC and LLVM in terms of
prefetch behavior for Falkor.

The default settings should guarantee no changes for existing targets. Those
are free to tweak the settings as necessary.

No regressions in the testsuite and bootstrapped ok on aarch64-linux.

Ok?

2018-01-22  Luis Machado  <luis.machado@linaro.org>

	Introduce option to limit software prefetching to known constant
	strides above a specific threshold with the goal of preventing
	conflicts with a hardware prefetcher.

	gcc/
	* config/aarch64/aarch64-protos.h (cpu_prefetch_tune)
	<minimum_stride>: New const int field.
	* config/aarch64/aarch64.c (generic_prefetch_tune): Update to include
	minimum_stride field.
	(exynosm1_prefetch_tune): Likewise.
	(thunderxt88_prefetch_tune): Likewise.
	(thunderx_prefetch_tune): Likewise.
	(thunderx2t99_prefetch_tune): Likewise.
	(qdf24xx_prefetch_tune): Likewise. Set minimum_stride to 2048.
	(aarch64_override_options_internal): Update to set
	PARAM_PREFETCH_MINIMUM_STRIDE.
	* doc/invoke.texi (prefetch-minimum-stride): Document new option.
	* params.def (PARAM_PREFETCH_MINIMUM_STRIDE): New.
	* params.h (PARAM_PREFETCH_MINIMUM_STRIDE): Define.
	* tree-ssa-loop-prefetch.c (should_issue_prefetch_p): Return false if
	stride is constant and is below the minimum stride threshold.
---
 gcc/config/aarch64/aarch64-protos.h |  3 +++
 gcc/config/aarch64/aarch64.c        | 13 ++++++++++++-
 gcc/doc/invoke.texi                 | 15 +++++++++++++++
 gcc/params.def                      |  9 +++++++++
 gcc/params.h                        |  2 ++
 gcc/tree-ssa-loop-prefetch.c        | 16 ++++++++++++++++
 6 files changed, 57 insertions(+), 1 deletion(-)

-- 
2.7.4

Comments

Kyrill Tkachov Jan. 23, 2018, 9:32 a.m. | #1
Hi Luis,

On 22/01/18 13:46, Luis Machado wrote:
> This patch adds a new option to control the minimum stride, for a memory

> reference, after which the loop prefetch pass may issue software prefetch

> hints for. There are two motivations:

>

> * Make the pass less aggressive, only issuing prefetch hints for bigger strides

> that are more likely to benefit from prefetching. I've noticed a case in cpu2017

> where we were issuing thousands of hints, for example.

>


I've noticed a large amount of prefetch hints being issued as well, but had not
analysed it further.

> * For processors that have a hardware prefetcher, like Falkor, it allows the

> loop prefetch pass to defer prefetching of smaller (less than the threshold)

> strides to the hardware prefetcher instead. This prevents conflicts between

> the software prefetcher and the hardware prefetcher.

>

> I've noticed considerable reduction in the number of prefetch hints and

> slightly positive performance numbers. This aligns GCC and LLVM in terms of

> prefetch behavior for Falkor.


Do you, by any chance, have a link to the LLVM review that implemented that behavior?
It's okay if you don't, but I think it would be useful context.

>

> The default settings should guarantee no changes for existing targets. Those

> are free to tweak the settings as necessary.

>

> No regressions in the testsuite and bootstrapped ok on aarch64-linux.

>

> Ok?

>


Are there any benchmark numbers you can share?
I think this approach is sensible.

Since your patch touches generic code as well as AArch64
code you'll need an approval from a midend maintainer as well as an AArch64 maintainer.
Also, GCC development is now in the regression fixing stage, so unless this fixes a regression
it may have to wait until GCC 9 development is opened.

Thanks,
Kyrill

> 2018-01-22  Luis Machado  <luis.machado@linaro.org>

>

>         Introduce option to limit software prefetching to known constant

>         strides above a specific threshold with the goal of preventing

>         conflicts with a hardware prefetcher.

>

>         gcc/

>         * config/aarch64/aarch64-protos.h (cpu_prefetch_tune)

>         <minimum_stride>: New const int field.

>         * config/aarch64/aarch64.c (generic_prefetch_tune): Update to include

>         minimum_stride field.

>         (exynosm1_prefetch_tune): Likewise.

>         (thunderxt88_prefetch_tune): Likewise.

>         (thunderx_prefetch_tune): Likewise.

>         (thunderx2t99_prefetch_tune): Likewise.

>         (qdf24xx_prefetch_tune): Likewise. Set minimum_stride to 2048.

>         (aarch64_override_options_internal): Update to set

>         PARAM_PREFETCH_MINIMUM_STRIDE.

>         * doc/invoke.texi (prefetch-minimum-stride): Document new option.

>         * params.def (PARAM_PREFETCH_MINIMUM_STRIDE): New.

>         * params.h (PARAM_PREFETCH_MINIMUM_STRIDE): Define.

>         * tree-ssa-loop-prefetch.c (should_issue_prefetch_p): Return false if

>         stride is constant and is below the minimum stride threshold.

> ---

>  gcc/config/aarch64/aarch64-protos.h |  3 +++

>  gcc/config/aarch64/aarch64.c        | 13 ++++++++++++-

>  gcc/doc/invoke.texi                 | 15 +++++++++++++++

>  gcc/params.def                      |  9 +++++++++

>  gcc/params.h                        |  2 ++

>  gcc/tree-ssa-loop-prefetch.c        | 16 ++++++++++++++++

>  6 files changed, 57 insertions(+), 1 deletion(-)

>

> diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h

> index ef1b0bc..8736bd9 100644

> --- a/gcc/config/aarch64/aarch64-protos.h

> +++ b/gcc/config/aarch64/aarch64-protos.h

> @@ -230,6 +230,9 @@ struct cpu_prefetch_tune

>    const int l1_cache_size;

>    const int l1_cache_line_size;

>    const int l2_cache_size;

> +  /* The minimum constant stride beyond which we should use prefetch

> +     hints for.  */

> +  const int minimum_stride;

>    const int default_opt_level;

>  };

>

> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c

> index 174310c..0ed9f14 100644

> --- a/gcc/config/aarch64/aarch64.c

> +++ b/gcc/config/aarch64/aarch64.c

> @@ -547,6 +547,7 @@ static const cpu_prefetch_tune generic_prefetch_tune =

>    -1,                  /* l1_cache_size  */

>    -1,                  /* l1_cache_line_size  */

>    -1,                  /* l2_cache_size  */

> +  -1,                  /* minimum_stride */

>    -1                   /* default_opt_level  */

>  };

>

> @@ -556,6 +557,7 @@ static const cpu_prefetch_tune exynosm1_prefetch_tune =

>    -1,                  /* l1_cache_size  */

>    64,                  /* l1_cache_line_size  */

>    -1,                  /* l2_cache_size  */

> +  -1,                  /* minimum_stride */

>    -1                   /* default_opt_level  */

>  };

>

> @@ -565,7 +567,8 @@ static const cpu_prefetch_tune qdf24xx_prefetch_tune =

>    32,                  /* l1_cache_size  */

>    64,                  /* l1_cache_line_size  */

>    1024,                        /* l2_cache_size  */

> -  -1                   /* default_opt_level  */

> +  2048,                        /* minimum_stride */

> +  3                    /* default_opt_level  */

>  };

>

>  static const cpu_prefetch_tune thunderxt88_prefetch_tune =

> @@ -574,6 +577,7 @@ static const cpu_prefetch_tune thunderxt88_prefetch_tune =

>    32,                  /* l1_cache_size  */

>    128,                 /* l1_cache_line_size  */

>    16*1024,             /* l2_cache_size  */

> +  -1,                  /* minimum_stride */

>    3                    /* default_opt_level  */

>  };

>

> @@ -583,6 +587,7 @@ static const cpu_prefetch_tune thunderx_prefetch_tune =

>    32,                  /* l1_cache_size  */

>    128,                 /* l1_cache_line_size  */

>    -1,                  /* l2_cache_size  */

> +  -1,                  /* minimum_stride */

>    -1                   /* default_opt_level  */

>  };

>

> @@ -592,6 +597,7 @@ static const cpu_prefetch_tune thunderx2t99_prefetch_tune =

>    32,                  /* l1_cache_size  */

>    64,                  /* l1_cache_line_size  */

>    256,                 /* l2_cache_size  */

> +  -1,                  /* minimum_stride */

>    -1                   /* default_opt_level  */

>  };

>

> @@ -10461,6 +10467,11 @@ aarch64_override_options_internal (struct gcc_options *opts)

> aarch64_tune_params.prefetch->l2_cache_size,

>                             opts->x_param_values,

> global_options_set.x_param_values);

> +  if (aarch64_tune_params.prefetch->minimum_stride >= 0)

> +    maybe_set_param_value (PARAM_PREFETCH_MINIMUM_STRIDE,

> + aarch64_tune_params.prefetch->minimum_stride,

> +                          opts->x_param_values,

> + global_options_set.x_param_values);

>

>    /* Use the alternative scheduling-pressure algorithm by default.  */

>    maybe_set_param_value (PARAM_SCHED_PRESSURE_ALGORITHM, SCHED_PRESSURE_MODEL,

> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi

> index 27c5974..1cb1ef5 100644

> --- a/gcc/doc/invoke.texi

> +++ b/gcc/doc/invoke.texi

> @@ -10567,6 +10567,21 @@ The size of L1 cache, in kilobytes.

>  @item l2-cache-size

>  The size of L2 cache, in kilobytes.

>

> +@item prefetch-minimum-stride

> +Minimum constant stride, in bytes, to start using prefetch hints for.  If

> +the stride is less than this threshold, prefetch hints will not be issued.

> +

> +This setting is useful for processors that have hardware prefetchers, in

> +which case there may be conflicts between the hardware prefetchers and

> +the software prefetchers.  If the hardware prefetchers have a maximum

> +stride they can handle, it should be used here to improve the use of

> +software prefetchers.

> +

> +A value of -1, the default, means we don't have a threshold and therefore

> +prefetch hints can be issued for any constant stride.

> +

> +This setting is only useful for strides that are known and constant.

> +

>  @item loop-interchange-max-num-stmts

>  The maximum number of stmts in a loop to be interchanged.

>

> diff --git a/gcc/params.def b/gcc/params.def

> index 930b318..bf2d12c 100644

> --- a/gcc/params.def

> +++ b/gcc/params.def

> @@ -790,6 +790,15 @@ DEFPARAM (PARAM_L2_CACHE_SIZE,

>            "The size of L2 cache.",

>            512, 0, 0)

>

> +/* The minimum constant stride beyond which we should use prefetch hints

> +   for.  */

> +

> +DEFPARAM (PARAM_PREFETCH_MINIMUM_STRIDE,

> +         "prefetch-minimum-stride",

> +         "The minimum constant stride beyond which we should use prefetch "

> +         "hints for.",

> +         -1, 0, 0)

> +

>  /* Maximum number of statements in loop nest for loop interchange.  */

>

>  DEFPARAM (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS,

> diff --git a/gcc/params.h b/gcc/params.h

> index 98249d2..96012db 100644

> --- a/gcc/params.h

> +++ b/gcc/params.h

> @@ -196,6 +196,8 @@ extern void init_param_values (int *params);

>    PARAM_VALUE (PARAM_L1_CACHE_LINE_SIZE)

>  #define L2_CACHE_SIZE \

>    PARAM_VALUE (PARAM_L2_CACHE_SIZE)

> +#define PREFETCH_MINIMUM_STRIDE \

> +  PARAM_VALUE (PARAM_PREFETCH_MINIMUM_STRIDE)

>  #define USE_CANONICAL_TYPES \

>    PARAM_VALUE (PARAM_USE_CANONICAL_TYPES)

>  #define IRA_MAX_LOOPS_NUM \

> diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c

> index 2f10db1..112ccac 100644

> --- a/gcc/tree-ssa-loop-prefetch.c

> +++ b/gcc/tree-ssa-loop-prefetch.c

> @@ -992,6 +992,22 @@ prune_by_reuse (struct mem_ref_group *groups)

>  static bool

>  should_issue_prefetch_p (struct mem_ref *ref)

>  {

> +  /* Some processors may have a hardware prefetcher that may conflict with

> +     prefetch hints for a range of strides.  Make sure we don't issue

> +     prefetches for such cases if the stride is within this particular

> +     range.  */

> +  if (cst_and_fits_in_hwi (ref->group->step)

> +      && absu_hwi (int_cst_value (ref->group->step)) < PREFETCH_MINIMUM_STRIDE)

> +    {

> +      if (dump_file && (dump_flags & TDF_DETAILS))

> +       fprintf (dump_file,

> +                "Step for reference %u:%u (%d) is less than the mininum "

> +                " required stride of %d\n",

> +                ref->group->uid, ref->uid, int_cst_value (ref->group->step),

> +                PREFETCH_MINIMUM_STRIDE);

> +      return false;

> +    }

> +

>    /* For now do not issue prefetches for only first few of the

>       iterations.  */

>    if (ref->prefetch_before != PREFETCH_ALL)

> -- 

> 2.7.4

>
Luis Machado Jan. 23, 2018, 1:12 p.m. | #2
Hi Kyrill,

On 01/23/2018 07:32 AM, Kyrill Tkachov wrote:
> Hi Luis,

> 

> On 22/01/18 13:46, Luis Machado wrote:

>> This patch adds a new option to control the minimum stride, for a memory

>> reference, after which the loop prefetch pass may issue software prefetch

>> hints for. There are two motivations:

>>

>> * Make the pass less aggressive, only issuing prefetch hints for 

>> bigger strides

>> that are more likely to benefit from prefetching. I've noticed a case 

>> in cpu2017

>> where we were issuing thousands of hints, for example.

>>

> 

> I've noticed a large amount of prefetch hints being issued as well, but 

> had not

> analysed it further.

> 


I've gathered some numbers for this. Some of the most extreme cases 
before both patches:

CPU2017

xalancbmk_s: 3755 hints
wrf_s: 10950 hints
parest_r: 8521 hints

CPU2006

gamess: 11377 hints
wrf: 3238 hints

After both patches:

CPU2017

xalancbmk_s: 1 hint
wrf_s: 20 hints
parest_r: 0 hints

CPU2006

gamess: 44 hints
wrf: 16 hints


>> * For processors that have a hardware prefetcher, like Falkor, it 

>> allows the

>> loop prefetch pass to defer prefetching of smaller (less than the 

>> threshold)

>> strides to the hardware prefetcher instead. This prevents conflicts 

>> between

>> the software prefetcher and the hardware prefetcher.

>>

>> I've noticed considerable reduction in the number of prefetch hints and

>> slightly positive performance numbers. This aligns GCC and LLVM in 

>> terms of

>> prefetch behavior for Falkor.

> 

> Do you, by any chance, have a link to the LLVM review that implemented 

> that behavior?

> It's okay if you don't, but I think it would be useful context.

> 


I've dug it up. The base change was implemented here:

review: https://reviews.llvm.org/D17945
RFC: http://lists.llvm.org/pipermail/llvm-dev/2015-December/093514.html

And then target-specific changes were introduced later for specific 
processors.

One small difference in LLVM is the fact that the second parameter, 
prefetching of non-constant strides, is implicitly switched off if one 
sets the minimum stride length. My approach here makes that second 
parameter adjustable.

I've seen big gains due to prefetching of non-constant strides, but it 
tends to be tricky to control and usually comes together with 
significant regressions as well.

The fact that we potentially unroll loops along with issuing prefetch 
hints also makes things a bit erratic.

>>

>> The default settings should guarantee no changes for existing targets. 

>> Those

>> are free to tweak the settings as necessary.

>>

>> No regressions in the testsuite and bootstrapped ok on aarch64-linux.

>>

>> Ok?

>>

> 

> Are there any benchmark numbers you can share?

> I think this approach is sensible.

> 


Comparing the previous, more aggressive, pass behavior with the new one 
i've seen a slight improvement for CPU2006, 0.15% for both INT and FP.

For CPU2017 the previous behavior was actually a bit harmful, regressing 
performance by about 1.2% in intspeed. The new behavior kept intspeed 
stable and slightly improved fpspeed by 0.15%.

The motivation for the future is to have better control of software 
prefetching so we can fine-tune the pass, either through generic loop 
prefetch code or by using the target-specific parameters.

> Since your patch touches generic code as well as AArch64

> code you'll need an approval from a midend maintainer as well as an 

> AArch64 maintainer.

> Also, GCC development is now in the regression fixing stage, so unless 

> this fixes a regression

> it may have to wait until GCC 9 development is opened.


That is my understanding. I thought i'd put this up for review anyway so 
people can chime in and provide their thoughts.

Thanks for the review.

Luis

Patch

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index ef1b0bc..8736bd9 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -230,6 +230,9 @@  struct cpu_prefetch_tune
   const int l1_cache_size;
   const int l1_cache_line_size;
   const int l2_cache_size;
+  /* The minimum constant stride beyond which we should use prefetch
+     hints for.  */
+  const int minimum_stride;
   const int default_opt_level;
 };
 
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 174310c..0ed9f14 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -547,6 +547,7 @@  static const cpu_prefetch_tune generic_prefetch_tune =
   -1,			/* l1_cache_size  */
   -1,			/* l1_cache_line_size  */
   -1,			/* l2_cache_size  */
+  -1,			/* minimum_stride */
   -1			/* default_opt_level  */
 };
 
@@ -556,6 +557,7 @@  static const cpu_prefetch_tune exynosm1_prefetch_tune =
   -1,			/* l1_cache_size  */
   64,			/* l1_cache_line_size  */
   -1,			/* l2_cache_size  */
+  -1,			/* minimum_stride */
   -1			/* default_opt_level  */
 };
 
@@ -565,7 +567,8 @@  static const cpu_prefetch_tune qdf24xx_prefetch_tune =
   32,			/* l1_cache_size  */
   64,			/* l1_cache_line_size  */
   1024,			/* l2_cache_size  */
-  -1			/* default_opt_level  */
+  2048,			/* minimum_stride */
+  3			/* default_opt_level  */
 };
 
 static const cpu_prefetch_tune thunderxt88_prefetch_tune =
@@ -574,6 +577,7 @@  static const cpu_prefetch_tune thunderxt88_prefetch_tune =
   32,			/* l1_cache_size  */
   128,			/* l1_cache_line_size  */
   16*1024,		/* l2_cache_size  */
+  -1,			/* minimum_stride */
   3			/* default_opt_level  */
 };
 
@@ -583,6 +587,7 @@  static const cpu_prefetch_tune thunderx_prefetch_tune =
   32,			/* l1_cache_size  */
   128,			/* l1_cache_line_size  */
   -1,			/* l2_cache_size  */
+  -1,			/* minimum_stride */
   -1			/* default_opt_level  */
 };
 
@@ -592,6 +597,7 @@  static const cpu_prefetch_tune thunderx2t99_prefetch_tune =
   32,			/* l1_cache_size  */
   64,			/* l1_cache_line_size  */
   256,			/* l2_cache_size  */
+  -1,			/* minimum_stride */
   -1			/* default_opt_level  */
 };
 
@@ -10461,6 +10467,11 @@  aarch64_override_options_internal (struct gcc_options *opts)
 			   aarch64_tune_params.prefetch->l2_cache_size,
 			   opts->x_param_values,
 			   global_options_set.x_param_values);
+  if (aarch64_tune_params.prefetch->minimum_stride >= 0)
+    maybe_set_param_value (PARAM_PREFETCH_MINIMUM_STRIDE,
+			   aarch64_tune_params.prefetch->minimum_stride,
+			   opts->x_param_values,
+			   global_options_set.x_param_values);
 
   /* Use the alternative scheduling-pressure algorithm by default.  */
   maybe_set_param_value (PARAM_SCHED_PRESSURE_ALGORITHM, SCHED_PRESSURE_MODEL,
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 27c5974..1cb1ef5 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -10567,6 +10567,21 @@  The size of L1 cache, in kilobytes.
 @item l2-cache-size
 The size of L2 cache, in kilobytes.
 
+@item prefetch-minimum-stride
+Minimum constant stride, in bytes, to start using prefetch hints for.  If
+the stride is less than this threshold, prefetch hints will not be issued.
+
+This setting is useful for processors that have hardware prefetchers, in
+which case there may be conflicts between the hardware prefetchers and
+the software prefetchers.  If the hardware prefetchers have a maximum
+stride they can handle, it should be used here to improve the use of
+software prefetchers.
+
+A value of -1, the default, means we don't have a threshold and therefore
+prefetch hints can be issued for any constant stride.
+
+This setting is only useful for strides that are known and constant.
+
 @item loop-interchange-max-num-stmts
 The maximum number of stmts in a loop to be interchanged.
 
diff --git a/gcc/params.def b/gcc/params.def
index 930b318..bf2d12c 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -790,6 +790,15 @@  DEFPARAM (PARAM_L2_CACHE_SIZE,
 	  "The size of L2 cache.",
 	  512, 0, 0)
 
+/* The minimum constant stride beyond which we should use prefetch hints
+   for.  */
+
+DEFPARAM (PARAM_PREFETCH_MINIMUM_STRIDE,
+	  "prefetch-minimum-stride",
+	  "The minimum constant stride beyond which we should use prefetch "
+	  "hints for.",
+	  -1, 0, 0)
+
 /* Maximum number of statements in loop nest for loop interchange.  */
 
 DEFPARAM (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS,
diff --git a/gcc/params.h b/gcc/params.h
index 98249d2..96012db 100644
--- a/gcc/params.h
+++ b/gcc/params.h
@@ -196,6 +196,8 @@  extern void init_param_values (int *params);
   PARAM_VALUE (PARAM_L1_CACHE_LINE_SIZE)
 #define L2_CACHE_SIZE \
   PARAM_VALUE (PARAM_L2_CACHE_SIZE)
+#define PREFETCH_MINIMUM_STRIDE \
+  PARAM_VALUE (PARAM_PREFETCH_MINIMUM_STRIDE)
 #define USE_CANONICAL_TYPES \
   PARAM_VALUE (PARAM_USE_CANONICAL_TYPES)
 #define IRA_MAX_LOOPS_NUM \
diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c
index 2f10db1..112ccac 100644
--- a/gcc/tree-ssa-loop-prefetch.c
+++ b/gcc/tree-ssa-loop-prefetch.c
@@ -992,6 +992,22 @@  prune_by_reuse (struct mem_ref_group *groups)
 static bool
 should_issue_prefetch_p (struct mem_ref *ref)
 {
+  /* Some processors may have a hardware prefetcher that may conflict with
+     prefetch hints for a range of strides.  Make sure we don't issue
+     prefetches for such cases if the stride is within this particular
+     range.  */
+  if (cst_and_fits_in_hwi (ref->group->step)
+      && absu_hwi (int_cst_value (ref->group->step)) < PREFETCH_MINIMUM_STRIDE)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "Step for reference %u:%u (%d) is less than the mininum "
+		 " required stride of %d\n",
+		 ref->group->uid, ref->uid, int_cst_value (ref->group->step),
+		 PREFETCH_MINIMUM_STRIDE);
+      return false;
+    }
+
   /* For now do not issue prefetches for only first few of the
      iterations.  */
   if (ref->prefetch_before != PREFETCH_ALL)