PR78888 - add value range info for tolower/toupper

Message ID CAAgBjMmtn1_bTRbKv=LdaqVioaDTObjs+ubnS+QSvcqR6fecNg@mail.gmail.com
State New
Headers show

Commit Message

Prathamesh Kulkarni Aug. 3, 2017, 7:28 a.m.
Hi,
The attached patch adds value-range info for __builtin_tolower and
__builtin_toupper.
In the patch, I have just settled for anti-range ~['a', 'z'] for
return value of toupper.
Would that be correct albeit imprecise ?

A more precise range would be:
[0, UCHAR_MAX] intersect ~['a', 'z'] union EOF
as mentioned in PR.
I am not sure though if it's possible for a SSA_NAME to have multiple
disjoint ranges ?

Bootstrap+tested on x86_64-unknown-linux-gnu.

Thanks,
Prathamesh

Comments

Jakub Jelinek Aug. 3, 2017, 7:51 a.m. | #1
On Thu, Aug 03, 2017 at 12:58:06PM +0530, Prathamesh Kulkarni wrote:
> --- a/gcc/tree-vrp.c

> +++ b/gcc/tree-vrp.c

> @@ -3778,6 +3778,19 @@ extract_range_basic (value_range *vr, gimple *stmt)

>  		return;

>  	      }

>  	  break;

> +	case CFN_BUILT_IN_TOUPPER:

> +	case CFN_BUILT_IN_TOLOWER:

> +	  if (tree lhs = gimple_call_lhs (stmt))

> +	    {

> +	      tree type = TREE_TYPE (lhs);

> +	      unsigned char min = (cfn == CFN_BUILT_IN_TOUPPER) ? 'a' : 'A';

> +	      unsigned char max = (cfn == CFN_BUILT_IN_TOUPPER) ? 'z' : 'Z';

> +	      tree range_min = build_int_cstu (type, min);

> +	      tree range_max = build_int_cstu (type, max);

> +	      set_value_range (vr, VR_ANTI_RANGE, range_min, range_max, NULL);

> +	      return;


You are hardcoding here host characters and using it for target.
I think you need to use
lang_hooks.to_target_charset
(really no idea how it works or doesn't in LTO, but gimple-fold.c is already
using it among others).
Also, you're assuming that the 'a'-'z' and 'A'-'Z' ranges are without gaps,
which isn't the case for e.g. EBCDIC.  So I think you'd need to verify
(once?) that the target charset has this property.

	Jakub
Prathamesh Kulkarni Aug. 3, 2017, 10:43 a.m. | #2
On 3 August 2017 at 13:21, Jakub Jelinek <jakub@redhat.com> wrote:
> On Thu, Aug 03, 2017 at 12:58:06PM +0530, Prathamesh Kulkarni wrote:

>> --- a/gcc/tree-vrp.c

>> +++ b/gcc/tree-vrp.c

>> @@ -3778,6 +3778,19 @@ extract_range_basic (value_range *vr, gimple *stmt)

>>               return;

>>             }

>>         break;

>> +     case CFN_BUILT_IN_TOUPPER:

>> +     case CFN_BUILT_IN_TOLOWER:

>> +       if (tree lhs = gimple_call_lhs (stmt))

>> +         {

>> +           tree type = TREE_TYPE (lhs);

>> +           unsigned char min = (cfn == CFN_BUILT_IN_TOUPPER) ? 'a' : 'A';

>> +           unsigned char max = (cfn == CFN_BUILT_IN_TOUPPER) ? 'z' : 'Z';

>> +           tree range_min = build_int_cstu (type, min);

>> +           tree range_max = build_int_cstu (type, max);

>> +           set_value_range (vr, VR_ANTI_RANGE, range_min, range_max, NULL);

>> +           return;

>

> You are hardcoding here host characters and using it for target.

> I think you need to use

> lang_hooks.to_target_charset

> (really no idea how it works or doesn't in LTO, but gimple-fold.c is already

> using it among others).

> Also, you're assuming that the 'a'-'z' and 'A'-'Z' ranges are without gaps,

> which isn't the case for e.g. EBCDIC.  So I think you'd need to verify

> (once?) that the target charset has this property.

Hi Jakub,
Thanks for the suggestions, I wasn't aware about lang_hooks.to_target_charset.
Would following be a correct check that target has ascii charset and only then
set range to ~[a, z] ?

target_a = lang_hooks.to_target_charset ('a');
target_z = lang_hooks.to_target_charset('z');
if (target_a == 'a' && target_z == 'z')
  {
     // set vr to ~['a', 'z']
  }

Thanks,
Prathamesh
>

>         Jakub
Jakub Jelinek Aug. 3, 2017, 10:55 a.m. | #3
On Thu, Aug 03, 2017 at 04:13:38PM +0530, Prathamesh Kulkarni wrote:
> > You are hardcoding here host characters and using it for target.

> > I think you need to use

> > lang_hooks.to_target_charset

> > (really no idea how it works or doesn't in LTO, but gimple-fold.c is already

> > using it among others).

> > Also, you're assuming that the 'a'-'z' and 'A'-'Z' ranges are without gaps,

> > which isn't the case for e.g. EBCDIC.  So I think you'd need to verify

> > (once?) that the target charset has this property.

> Hi Jakub,

> Thanks for the suggestions, I wasn't aware about lang_hooks.to_target_charset.

> Would following be a correct check that target has ascii charset and only then

> set range to ~[a, z] ?

> 

> target_a = lang_hooks.to_target_charset ('a');

> target_z = lang_hooks.to_target_charset('z');

> if (target_a == 'a' && target_z == 'z')

>   {

>      // set vr to ~['a', 'z']

>   }


No.  E.g. if host == target charset is EBCDIC, then the condition is true,
but it isn't a consecutive range.
A rough check that would maybe work (at least would be false for EBCDIC
and true for ASCII) could be something like:
if (target_z > target_a && target_z - target_a == 25)
(which is not exact, some strange charset could have say '0'-'9' block in
the middle of 'a'-'z' block, and say 'h'-'r' outside of the range), but
perhaps good enough for real-world charsets.
In any case, you should probably investigate all the locales say in glibc or
some other big locale repository whether tolower/toupper have the expected
properties there.
Plus of course, the set vr to ~[ ] needs to use target_a/target_z.

	Jakub
Joseph Myers Aug. 3, 2017, 2:38 p.m. | #4
On Thu, 3 Aug 2017, Jakub Jelinek wrote:

> In any case, you should probably investigate all the locales say in glibc or

> some other big locale repository whether tolower/toupper have the expected

> properties there.


They don't.  In tr_TR.UTF-8, toupper ('i') == 'i', because 'İ', the 
correct uppercase version (as returned in locale tr_TR.ISO-8859-9), is a 
multibyte character and toupper can only return single-byte characters.

-- 
Joseph S. Myers
joseph@codesourcery.com
Jakub Jelinek Aug. 3, 2017, 2:50 p.m. | #5
On Thu, Aug 03, 2017 at 02:38:35PM +0000, Joseph Myers wrote:
> On Thu, 3 Aug 2017, Jakub Jelinek wrote:

> 

> > In any case, you should probably investigate all the locales say in glibc or

> > some other big locale repository whether tolower/toupper have the expected

> > properties there.

> 

> They don't.  In tr_TR.UTF-8, toupper ('i') == 'i', because 'İ', the 

> correct uppercase version (as returned in locale tr_TR.ISO-8859-9), is a 

> multibyte character and toupper can only return single-byte characters.


Indeed,
#include <ctype.h>
#include <locale.h>

int
main ()
{
  setlocale (LC_ALL, "");
  int i;
  for (i = -1000; i < 1000; i++)
    if (tolower (i) >= 'A' && tolower (i) <= 'Z')
      __builtin_abort ();
    else if (toupper (i) >= 'a' && toupper (i) <= 'z')
      __builtin_abort ();
  return 0;
}
fails for LC_ALL=tr_TR.UTF-8, because tolower ('I') is 'I'.
Not to mention that the result is unspecified if the functions are called
with a value outside of the range of unsigned char or EOF.
Therefore, this optimization is invalid.

	Jakub

Patch

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr78888.c b/gcc/testsuite/gcc.dg/tree-ssa/pr78888.c
new file mode 100644
index 00000000000..2bcddf1f2c3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr78888.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-evrp-slim" } */
+
+void g (int x)
+{
+  if (__builtin_toupper ((unsigned char)x) == 'a')
+    __builtin_abort ();
+}
+
+void h (int x)
+{
+  if (__builtin_tolower ((unsigned char)x) == 'A')
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump-not "__builtin_abort" "evrp" } } */
diff --git a/gcc/tree-vrp.c b/gcc/tree-vrp.c
index 79a29bf0efb..7137a4c52ec 100644
--- a/gcc/tree-vrp.c
+++ b/gcc/tree-vrp.c
@@ -3778,6 +3778,19 @@  extract_range_basic (value_range *vr, gimple *stmt)
 		return;
 	      }
 	  break;
+	case CFN_BUILT_IN_TOUPPER:
+	case CFN_BUILT_IN_TOLOWER:
+	  if (tree lhs = gimple_call_lhs (stmt))
+	    {
+	      tree type = TREE_TYPE (lhs);
+	      unsigned char min = (cfn == CFN_BUILT_IN_TOUPPER) ? 'a' : 'A';
+	      unsigned char max = (cfn == CFN_BUILT_IN_TOUPPER) ? 'z' : 'Z';
+	      tree range_min = build_int_cstu (type, min);
+	      tree range_max = build_int_cstu (type, max);
+	      set_value_range (vr, VR_ANTI_RANGE, range_min, range_max, NULL);
+	      return;
+	    }
+	  break;
 	default:
 	  break;
 	}