59371 – [11/12/13/14/15 Regression] Performance regression in GCC 4.8/9/10/11/12 and later versions.

Bug 59371 - [11/12/13/14/15 Regression] Performance regression in GCC 4.8/9/10/11/12 and later versions.

Summary: [11/12/13/14/15 Regression] Performance regression in GCC 4.8/9/10/11/12 and ...

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	4.9.0

Importance:	P2 normal
Target Milestone:	11.5
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2013-12-02 17:58 UTC by Steve Ellcey
Modified:	2024-04-26 10:29 UTC (History)
CC List:	5 users (show)

See Also:
Host:
Target:	mips--*
Build:
Known to work:
Known to fail:
Last reconfirmed:	2021-11-21 00:00:00

Attachments
runnnable performance test case (225 bytes, text/x-csrc) 2013-12-03 16:42 UTC, Steve Ellcey	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Steve Ellcey 2013-12-02 17:58:58 UTC

If I compile this program with -O2 on MIPS:

int foo(int *p, unsigned short c)
{
	signed short i;
	int x = 0;
	for (i = 0; i < c; i++) {
		x = x + *p; p++;
	}
	return x;
}

With GCC 4.7.* or earlier I get loop code that looks like:

$L3:
	lw	$5,0($4)
	addiu	$3,$3,1
	seh	$3,$3
	addu	$2,$2,$5
	bne	$3,$6,$L3
	addiu	$4,$4,4

With GCC 4.8 and later I get:

$L3:
	lw	$7,0($4)
	addiu	$3,$3,1
	seh	$3,$3
	slt	$6,$3,$5
	addu	$2,$2,$7
	bne	$6,$0,$L3
	addiu	$4,$4,4

This loop has one more instruction in it and is slower.
A version of this bug appears in EEMBC 1.1.  If I change the loop
index to be unsigned then I get the better code but I can't change
the benchmark I am testing so I am trying to figure out what changed
in GCC and how to generate the faster code.

Comment 1 Richard Biener 2013-12-03 09:37:03 UTC

The IVOPTs dump looks different on x86_64, it looks like 4.7 was able to do
a better job here.

Can you provide a runnable testcase so we can see if x86_64 regressed
in speed as well?

Comment 2 Steve Ellcey 2013-12-03 16:42:44 UTC

Created attachment 31365 [details]
runnnable performance test case

Here is a runnable test case.  You may need to increase the loop counts depending on the speed of your machine.  On the MIPS system I used a 4.7.3 GCC took 36.127 seconds and a 4.8.1 GCC took 38.435 seconds (Little endian, -O2, static linking).

Comment 3 Maciej W. Rozycki 2013-12-03 23:08:46 UTC

Caused by:

------------------------------------------------------------------------
r193882 | rguenth | 2012-11-28 09:27:10 +0000 (Wed, 28 Nov 2012) | 19 lines

2012-11-28  Richard Biener  <rguenther@suse.de>

	PR c/35634
	* gimple.h (gimplify_self_mod_expr): Declare.
	* gimplify.c (gimplify_self_mod_expr): Export.  Take a different
	type for performing the arithmetic in.
	(gimplify_expr): Adjust.
	* tree-vect-loop-manip.c (vect_can_advance_ivs_p): Strip
	sign conversions we can re-apply after adjusting the IV.

	c-family/
	* c-gimplify.c (c_gimplify_expr): Gimplify self-modify expressions
	here and use a type with proper overflow behavior for types that would
	need to be promoted for the arithmetic.

	* gcc.dg/torture/pr35634.c: New testcase.
	* g++.dg/torture/pr35634.C: Likewise.
	* gcc.dg/vect/pr18536.c: Mark worker function noinline.

------------------------------------------------------------------------

Comment 4 Andrew Pinski 2013-12-03 23:10:48 UTC

(In reply to Maciej W. Rozycki from comment #3)
> Caused by:

Well that corrects how i++ is done.

Comment 5 Maciej W. Rozycki 2013-12-04 01:49:45 UTC

(In reply to Andrew Pinski from comment #4)
> Well that corrects how i++ is done.

Old MIPS assembly code produced was AFAICT correct.  The loop termination
condition was expressed as:

	bne	$3,$6,$L3

that represented (i != c) rather than (i < c), but we start `i' from 0
and increment by one at a time, so both expressions are equivalent in
this context.

Here I believe the following C language standard clause applies[1]:

"Otherwise, if the operand that has unsigned integer type has rank
greater or equal to the rank of the type of the other operand, then the
operand with signed integer type is converted to the type of the operand
with unsigned integer type."

so for both operands the expression is supposed to use the "unsigned
short" type, that is 16-bit on the MIPS target.  There are no 16-bit ALU
operations defined in the MIPS architecture though, so at the assembly
(and therefore machine-level) level both `c' and `i' were sign-extended
to 32-bits:

	andi	$5,$5,0xffff
	seh	$6,$5

and:

	seh	$3,$3

respectively (of course ANDI is redundant here, there's no need to
zero-extend before sign-extending, SEH does not require it), before the
BNE comparison quoted above was made.  That correctly mimicked 16-bit
operations required by the language here (of course zero-extension of
both `c' and `i' would do as well).

Now after the change `c' is zero-extended only (no sign-extension
afterwards):

	andi	$5,$5,0xffff

while `i' is still sign-extended:

	seh	$3,$3

Then the loop termination condition is expressed as:

	slt	$6,$3,$5
	bne	$6,$0,$L3

instead.  Notice the SLT instruction, that accurately represents the
(i < c) termination condition, however using *signed* arithmetic.  Which
means that for `c' equal e.g. to 32768 the loop will never terminate.  I
believe this is not what the clause of the C language standard quoted
above implies.  For unsigned arithmetic SLTU would have to be used
instead.

So it looks to me like the performance regression merely happens to be
a visible sign of a bigger correctness problem.  Have I missed anything?

[1] "Programming languages -- C", ISO/IEC 9899:1999(E), Section 6.3.1.8
    "Usual arithmetic conversions".

Comment 6 Andrew Pinski 2013-12-04 01:52:51 UTC

(In reply to Maciej W. Rozycki from comment #5)
> (In reply to Andrew Pinski from comment #4)
> > Well that corrects how i++ is done.
> 
> Old MIPS assembly code produced was AFAICT correct.  The loop termination
> condition was expressed as:

Correctly representative in the gimple IR rather than in the final assembly.

Comment 7 Richard Biener 2013-12-05 11:12:54 UTC

Btw, the change suggests a workaround - use -fwrapv.  And indeed, the
loop does not terminate for c == 32768, but that's a "bug" in the testcase,
  signed short < 32768
is always true (both promote to int, and i++ wraps at 32767).

That i++ now correctly wraps 'defined' also means that IV analysis is
pessimized as it can not assume that 'i' does not wrap nor can it assume
that the loop terminates.

But that's just the awkward way the testcase is written (without the
C standard in mind ...).

You may also try -funsafe-loop-optimizations that makes us optimistically
assume IVs don't overflow and loops terminate.

"confirmed", but I think this one needs more analysis on _what_ transform
we want to happen and then get assessment on if that is a valid transform
at all.

Comment 8 Maciej W. Rozycki 2013-12-05 23:43:51 UTC

Richard,

 I wasn't aware integer promotions applied here, thanks for pointing it
out.  New code is therefore correct while old one was not.  Unfortunately
neither -fwrapv nor -funsafe-loop-optimizations changes anything.

  Maciej

Comment 9 Jakub Jelinek 2013-12-17 10:33:31 UTC

(In reply to Maciej W. Rozycki from comment #8)
> Richard,
> 
>  I wasn't aware integer promotions applied here, thanks for pointing it
> out.  New code is therefore correct while old one was not.  Unfortunately
> neither -fwrapv nor -funsafe-loop-optimizations changes anything.

But then it must be target specific thing.  Because, -fwrapv certainly changes it to the same IL as has been emitted before that change (also with -fwrapv, of course).

So, any reason not to close this PR, because while we generate slower code, the slower code is actually correct while the old one was wrong?

Comment 10 Maciej W. Rozycki 2014-01-09 20:16:22 UTC

(In reply to Jakub Jelinek from comment #9)

Jakub,

 The fix has corrected the evaluation of `i++' however it has regressed
the evaluation of `i < c'.  This is because in the loop `i' is only ever
assigned values that are lower than or equal to the value of `c' and
here the `int' type can represent all values representable with the
`signed short' and `unsigned short' types.

 Specifically, assuming the properties of the MIPS target, where `int' is
32-bit wide and `short' is 16-bit wide, we have the following cases of
the `c' input value here:

1. 0 -- the loop is not entered because `i' is preset to 0 and equals `c'
   right away.

2. [1,32767] -- `i' is incremented from 0 until it reaches the value of
   `c', at which point the loop terminates.

3. [32768,65535] -- `i' is incremented from 0 up to 32767, at which point
   it wraps around to -32768 and continues being incremented to 32767
   again.  And so on, and so on.  It never reaches the value of `c' or
   any higher value and therefore the loop never terminates.

Based on the above observations it is enough to check for `i == c' as the
loop termination condition.

 So I think this is still a performance regression from the user's point
of view even though I realise this may not necessarily be an optimisation
GCC has been previously designed for.  Therefore I'd prefer to keep the
bug open until/unless we decide it's impractical to implement a code
transformation that would restore previous performance.

  Maciej

Comment 11 Richard Biener 2014-05-22 09:05:33 UTC

GCC 4.8.3 is being released, adjusting target milestone.

Comment 12 Jakub Jelinek 2014-12-19 13:30:47 UTC

GCC 4.8.4 has been released.

Comment 13 Uroš Bizjak 2015-01-14 07:12:52 UTC

Difference (-O2 vs -O3) of neatly formatted results in a 16x16 array:

--- res-O2.txt  2015-01-14 08:08:39.000000000 +0100
+++ res-O3.txt  2015-01-14 08:09:57.000000000 +0100
@@ -1,7 +1,7 @@
-000, 001, 002, 003, 016, 017, 018, 019, 032, 033, 034, 035, 048, 049, 050, 051,
-064, 065, 066, 067, 080, 081, 082, 083, 096, 097, 098, 099, 112, 113, 114, 115,
-128, 129, 130, 131, 144, 145, 146, 147, 160, 161, 162, 163, 176, 177, 178, 179,
-192, 193, 194, 195, 208, 209, 210, 211, 224, 225, 226, 227, 240, 241, 242, 243,
+000, 001, 002, 003, 004, 005, 006, 007, 008, 009, 010, 011, 012, 013, 014, 015,
+016, 017, 018, 019, 020, 021, 022, 023, 024, 025, 026, 027, 028, 029, 030, 031,
+032, 033, 034, 035, 036, 037, 038, 039, 040, 041, 042, 043, 044, 045, 046, 047,
+048, 049, 050, 051, 052, 053, 054, 055, 056, 057, 058, 059, 060, 061, 062, 063,
 004, 005, 006, 007, 020, 021, 022, 023, 036, 037, 038, 039, 052, 053, 054, 055,
 068, 069, 070, 071, 084, 085, 086, 087, 100, 101, 102, 103, 116, 117, 118, 119,
 132, 133, 134, 135, 148, 149, 150, 151, 164, 165, 166, 167, 180, 181, 182, 183,

The problem is only with the first 64 elements.

Comment 14 Uroš Bizjak 2015-01-14 07:14:44 UTC

(In reply to Uroš Bizjak from comment #13)
> Difference (-O2 vs -O3) of neatly formatted results in a 16x16 array:

Sorry, this comment doesn't belong in this PR.

Comment 15 Steve Ellcey 2015-03-13 20:18:31 UTC

I am not sure yet where and how to improve this automatically but I have found an interesting hand optimization that could point to a way to fix this.  If I change the original function:

int foo(int *p, unsigned short c)
{
	signed short i;
	int x = 0;
	for (i = 0; i < c; i++) {
		x = x + *p; p++;
	}
	return x;
}

To:

int foo(int *p, unsigned short c)
{
	signed short i;
	unsigned short new_i;
	int x = 0;

	if (c > 32767)
		for (i = 0; i < c; i++) {
			x = x + *p; p++;
		}
	else
		for (new_i = 0; new_i < c; new_i++) {
			x = x + *p; p++;
		}
	return x;
}


Then GCC 5.0 generates an empty infinite loop for the first for loop and
a compact 4 instruction loop (better even then 4.7) for the second for loop.

I am not sure where or if we can do this optimization in GCC but I am going
to investigate some more.

Comment 16 Richard Biener 2015-06-23 08:19:32 UTC

The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.

Comment 17 Jakub Jelinek 2015-06-26 19:55:35 UTC

GCC 4.9.3 has been released.

Comment 18 Richard Biener 2016-08-03 11:32:29 UTC

GCC 4.9 branch is being closed

Comment 19 Jakub Jelinek 2017-01-24 11:36:19 UTC

(In reply to Steve Ellcey from comment #15)
> I am not sure yet where and how to improve this automatically but I have
> found an interesting hand optimization that could point to a way to fix
> this.  If I change the original function:

Perhaps tree-ssa-loop-niter.c could figure out that the IV has this kind of behavior and some loop optimization could then version the loop - one predicted likely with finite number of iterations and another one infinite.
Honza?

Comment 20 Richard Biener 2017-01-24 11:44:13 UTC

Yes, versioning for niter analysis is something that came up elsewhere as well.  Note we likely want sth like the loop-vectorized builtin so we can record
that both are versions of each other (so we can scrap one if both end up
"the same").  Or maybe simply record this in the loop struct (copy-of-loop-N).

Comment 21 Aldy Hernandez 2017-02-20 10:15:06 UTC

Assuming that changing the condition of the loop to !=, would fix the regression, could we use range info to determine this?

For instance, right before the ivopts pass, we have:

  <bb 2> [15.00%]:
  # RANGE [0, 65535] NONZERO 65535
  _19 = (int) c_11(D);
  if (_19 > 0)
    goto <bb 3>; [85.00%]
  else
    goto <bb 7>; [15.00%]

  <bb 3> [12.75%]:

  <bb 4> [85.00%]:
  # p_20 = PHI <p_14(6), p_10(D)(3)>
  # i_21 = PHI <i_15(6), 0(3)>
  # x_22 = PHI <x_13(6), 0(3)>
  _1 = *p_20;
  x_13 = _1 + x_22;
  p_14 = p_20 + 4;
  # RANGE [0, 65535]
  i.1_2 = (unsigned short) i_21;
  # RANGE [0, 65535]
  _3 = i.1_2 + 1;
  i_15 = (short int) _3;
  # RANGE [-32768, 32767]
  _4 = (int) i_15;
  if (_4 < _19)
    goto <bb 6>; [85.00%]
  else
    goto <bb 5>; [15.00%]

The range for the max value `c' (_19) is [0..65535].  So if we know that the induction variable `i' (_4) starts at the lower end of that (0), and we increment the IV by 1, then we could always replace the < by a !=, right?  This would avoid having to do loop versioning.

If this is acceptable, would we do this in the ivopts pass?  If so, I guess we could start by explaining to the pass that _4 is an IV.  Because right now I see that we determine that everything but _4 is an IV: i_21, i.1_2, _3, i_15.

Thoughts/hints greatly appreciated :).

Comment 22 Jakub Jelinek 2017-10-10 13:28:34 UTC

GCC 5 branch is being closed

Comment 23 Jakub Jelinek 2018-10-26 10:14:16 UTC

GCC 6 branch is being closed

Comment 24 Richard Biener 2019-11-14 07:53:27 UTC

The GCC 7 branch is being closed, re-targeting to GCC 8.4.

Comment 25 Jakub Jelinek 2020-03-04 09:34:44 UTC

GCC 8.4.0 has been released, adjusting target milestone.

Comment 26 Jakub Jelinek 2021-05-14 09:47:02 UTC

GCC 8 branch is being closed.

Comment 27 Jiu Fu Guo 2021-05-17 02:57:49 UTC

For -O2, since a few optimizations are not enabled (e.g. some loop-based optimizations), the code was not optimized too much.

At -O3, now, GCC could vectorize it.  While with GCC 4.8, the code was not vectorized.  I guess the pain in performance may be mitigated.

Comment 28 Jiu Fu Guo 2021-05-17 03:11:44 UTC

If change code as below, 'i' is not starting from '0', and 'compare code' is '!='
then wrap/overflow on 'i' may happen, and optimizations (e.g. vectorization) are not applied.
The below patch is trying to optimize this kind of loop.
https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570424.html

int foo (int *p, unsigned short u_n, unsigned short i)
{
  int x = 0;
  for (; i != u_n; i++) {
    x = x + p[i];
  }
  return x;
}

Comment 29 Richard Biener 2021-06-01 08:06:07 UTC

GCC 9.4 is being released, retargeting bugs to GCC 9.5.

Comment 30 Richard Biener 2022-05-27 09:35:03 UTC

GCC 9 branch is being closed

Comment 31 Jakub Jelinek 2022-06-28 10:30:45 UTC

GCC 10.4 is being released, retargeting bugs to GCC 10.5.

Comment 32 Richard Biener 2023-07-07 10:30:07 UTC

GCC 10 branch is being closed.