If I compile this program with -O2 on MIPS: int foo(int *p, unsigned short c) { signed short i; int x = 0; for (i = 0; i < c; i++) { x = x + *p; p++; } return x; } With GCC 4.7.* or earlier I get loop code that looks like: $L3: lw $5,0($4) addiu $3,$3,1 seh $3,$3 addu $2,$2,$5 bne $3,$6,$L3 addiu $4,$4,4 With GCC 4.8 and later I get: $L3: lw $7,0($4) addiu $3,$3,1 seh $3,$3 slt $6,$3,$5 addu $2,$2,$7 bne $6,$0,$L3 addiu $4,$4,4 This loop has one more instruction in it and is slower. A version of this bug appears in EEMBC 1.1. If I change the loop index to be unsigned then I get the better code but I can't change the benchmark I am testing so I am trying to figure out what changed in GCC and how to generate the faster code.
The IVOPTs dump looks different on x86_64, it looks like 4.7 was able to do a better job here. Can you provide a runnable testcase so we can see if x86_64 regressed in speed as well?
Created attachment 31365 [details] runnnable performance test case Here is a runnable test case. You may need to increase the loop counts depending on the speed of your machine. On the MIPS system I used a 4.7.3 GCC took 36.127 seconds and a 4.8.1 GCC took 38.435 seconds (Little endian, -O2, static linking).
Caused by: ------------------------------------------------------------------------ r193882 | rguenth | 2012-11-28 09:27:10 +0000 (Wed, 28 Nov 2012) | 19 lines 2012-11-28 Richard Biener <rguenther@suse.de> PR c/35634 * gimple.h (gimplify_self_mod_expr): Declare. * gimplify.c (gimplify_self_mod_expr): Export. Take a different type for performing the arithmetic in. (gimplify_expr): Adjust. * tree-vect-loop-manip.c (vect_can_advance_ivs_p): Strip sign conversions we can re-apply after adjusting the IV. c-family/ * c-gimplify.c (c_gimplify_expr): Gimplify self-modify expressions here and use a type with proper overflow behavior for types that would need to be promoted for the arithmetic. * gcc.dg/torture/pr35634.c: New testcase. * g++.dg/torture/pr35634.C: Likewise. * gcc.dg/vect/pr18536.c: Mark worker function noinline. ------------------------------------------------------------------------
(In reply to Maciej W. Rozycki from comment #3) > Caused by: Well that corrects how i++ is done.
(In reply to Andrew Pinski from comment #4) > Well that corrects how i++ is done. Old MIPS assembly code produced was AFAICT correct. The loop termination condition was expressed as: bne $3,$6,$L3 that represented (i != c) rather than (i < c), but we start `i' from 0 and increment by one at a time, so both expressions are equivalent in this context. Here I believe the following C language standard clause applies[1]: "Otherwise, if the operand that has unsigned integer type has rank greater or equal to the rank of the type of the other operand, then the operand with signed integer type is converted to the type of the operand with unsigned integer type." so for both operands the expression is supposed to use the "unsigned short" type, that is 16-bit on the MIPS target. There are no 16-bit ALU operations defined in the MIPS architecture though, so at the assembly (and therefore machine-level) level both `c' and `i' were sign-extended to 32-bits: andi $5,$5,0xffff seh $6,$5 and: seh $3,$3 respectively (of course ANDI is redundant here, there's no need to zero-extend before sign-extending, SEH does not require it), before the BNE comparison quoted above was made. That correctly mimicked 16-bit operations required by the language here (of course zero-extension of both `c' and `i' would do as well). Now after the change `c' is zero-extended only (no sign-extension afterwards): andi $5,$5,0xffff while `i' is still sign-extended: seh $3,$3 Then the loop termination condition is expressed as: slt $6,$3,$5 bne $6,$0,$L3 instead. Notice the SLT instruction, that accurately represents the (i < c) termination condition, however using *signed* arithmetic. Which means that for `c' equal e.g. to 32768 the loop will never terminate. I believe this is not what the clause of the C language standard quoted above implies. For unsigned arithmetic SLTU would have to be used instead. So it looks to me like the performance regression merely happens to be a visible sign of a bigger correctness problem. Have I missed anything? [1] "Programming languages -- C", ISO/IEC 9899:1999(E), Section 6.3.1.8 "Usual arithmetic conversions".
(In reply to Maciej W. Rozycki from comment #5) > (In reply to Andrew Pinski from comment #4) > > Well that corrects how i++ is done. > > Old MIPS assembly code produced was AFAICT correct. The loop termination > condition was expressed as: Correctly representative in the gimple IR rather than in the final assembly.
Btw, the change suggests a workaround - use -fwrapv. And indeed, the loop does not terminate for c == 32768, but that's a "bug" in the testcase, signed short < 32768 is always true (both promote to int, and i++ wraps at 32767). That i++ now correctly wraps 'defined' also means that IV analysis is pessimized as it can not assume that 'i' does not wrap nor can it assume that the loop terminates. But that's just the awkward way the testcase is written (without the C standard in mind ...). You may also try -funsafe-loop-optimizations that makes us optimistically assume IVs don't overflow and loops terminate. "confirmed", but I think this one needs more analysis on _what_ transform we want to happen and then get assessment on if that is a valid transform at all.
Richard, I wasn't aware integer promotions applied here, thanks for pointing it out. New code is therefore correct while old one was not. Unfortunately neither -fwrapv nor -funsafe-loop-optimizations changes anything. Maciej
(In reply to Maciej W. Rozycki from comment #8) > Richard, > > I wasn't aware integer promotions applied here, thanks for pointing it > out. New code is therefore correct while old one was not. Unfortunately > neither -fwrapv nor -funsafe-loop-optimizations changes anything. But then it must be target specific thing. Because, -fwrapv certainly changes it to the same IL as has been emitted before that change (also with -fwrapv, of course). So, any reason not to close this PR, because while we generate slower code, the slower code is actually correct while the old one was wrong?
(In reply to Jakub Jelinek from comment #9) Jakub, The fix has corrected the evaluation of `i++' however it has regressed the evaluation of `i < c'. This is because in the loop `i' is only ever assigned values that are lower than or equal to the value of `c' and here the `int' type can represent all values representable with the `signed short' and `unsigned short' types. Specifically, assuming the properties of the MIPS target, where `int' is 32-bit wide and `short' is 16-bit wide, we have the following cases of the `c' input value here: 1. 0 -- the loop is not entered because `i' is preset to 0 and equals `c' right away. 2. [1,32767] -- `i' is incremented from 0 until it reaches the value of `c', at which point the loop terminates. 3. [32768,65535] -- `i' is incremented from 0 up to 32767, at which point it wraps around to -32768 and continues being incremented to 32767 again. And so on, and so on. It never reaches the value of `c' or any higher value and therefore the loop never terminates. Based on the above observations it is enough to check for `i == c' as the loop termination condition. So I think this is still a performance regression from the user's point of view even though I realise this may not necessarily be an optimisation GCC has been previously designed for. Therefore I'd prefer to keep the bug open until/unless we decide it's impractical to implement a code transformation that would restore previous performance. Maciej
GCC 4.8.3 is being released, adjusting target milestone.
GCC 4.8.4 has been released.
Difference (-O2 vs -O3) of neatly formatted results in a 16x16 array: --- res-O2.txt 2015-01-14 08:08:39.000000000 +0100 +++ res-O3.txt 2015-01-14 08:09:57.000000000 +0100 @@ -1,7 +1,7 @@ -000, 001, 002, 003, 016, 017, 018, 019, 032, 033, 034, 035, 048, 049, 050, 051, -064, 065, 066, 067, 080, 081, 082, 083, 096, 097, 098, 099, 112, 113, 114, 115, -128, 129, 130, 131, 144, 145, 146, 147, 160, 161, 162, 163, 176, 177, 178, 179, -192, 193, 194, 195, 208, 209, 210, 211, 224, 225, 226, 227, 240, 241, 242, 243, +000, 001, 002, 003, 004, 005, 006, 007, 008, 009, 010, 011, 012, 013, 014, 015, +016, 017, 018, 019, 020, 021, 022, 023, 024, 025, 026, 027, 028, 029, 030, 031, +032, 033, 034, 035, 036, 037, 038, 039, 040, 041, 042, 043, 044, 045, 046, 047, +048, 049, 050, 051, 052, 053, 054, 055, 056, 057, 058, 059, 060, 061, 062, 063, 004, 005, 006, 007, 020, 021, 022, 023, 036, 037, 038, 039, 052, 053, 054, 055, 068, 069, 070, 071, 084, 085, 086, 087, 100, 101, 102, 103, 116, 117, 118, 119, 132, 133, 134, 135, 148, 149, 150, 151, 164, 165, 166, 167, 180, 181, 182, 183, The problem is only with the first 64 elements.
(In reply to Uroš Bizjak from comment #13) > Difference (-O2 vs -O3) of neatly formatted results in a 16x16 array: Sorry, this comment doesn't belong in this PR.
I am not sure yet where and how to improve this automatically but I have found an interesting hand optimization that could point to a way to fix this. If I change the original function: int foo(int *p, unsigned short c) { signed short i; int x = 0; for (i = 0; i < c; i++) { x = x + *p; p++; } return x; } To: int foo(int *p, unsigned short c) { signed short i; unsigned short new_i; int x = 0; if (c > 32767) for (i = 0; i < c; i++) { x = x + *p; p++; } else for (new_i = 0; new_i < c; new_i++) { x = x + *p; p++; } return x; } Then GCC 5.0 generates an empty infinite loop for the first for loop and a compact 4 instruction loop (better even then 4.7) for the second for loop. I am not sure where or if we can do this optimization in GCC but I am going to investigate some more.
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
GCC 4.9.3 has been released.
GCC 4.9 branch is being closed
(In reply to Steve Ellcey from comment #15) > I am not sure yet where and how to improve this automatically but I have > found an interesting hand optimization that could point to a way to fix > this. If I change the original function: Perhaps tree-ssa-loop-niter.c could figure out that the IV has this kind of behavior and some loop optimization could then version the loop - one predicted likely with finite number of iterations and another one infinite. Honza?
Yes, versioning for niter analysis is something that came up elsewhere as well. Note we likely want sth like the loop-vectorized builtin so we can record that both are versions of each other (so we can scrap one if both end up "the same"). Or maybe simply record this in the loop struct (copy-of-loop-N).
Assuming that changing the condition of the loop to !=, would fix the regression, could we use range info to determine this? For instance, right before the ivopts pass, we have: <bb 2> [15.00%]: # RANGE [0, 65535] NONZERO 65535 _19 = (int) c_11(D); if (_19 > 0) goto <bb 3>; [85.00%] else goto <bb 7>; [15.00%] <bb 3> [12.75%]: <bb 4> [85.00%]: # p_20 = PHI <p_14(6), p_10(D)(3)> # i_21 = PHI <i_15(6), 0(3)> # x_22 = PHI <x_13(6), 0(3)> _1 = *p_20; x_13 = _1 + x_22; p_14 = p_20 + 4; # RANGE [0, 65535] i.1_2 = (unsigned short) i_21; # RANGE [0, 65535] _3 = i.1_2 + 1; i_15 = (short int) _3; # RANGE [-32768, 32767] _4 = (int) i_15; if (_4 < _19) goto <bb 6>; [85.00%] else goto <bb 5>; [15.00%] The range for the max value `c' (_19) is [0..65535]. So if we know that the induction variable `i' (_4) starts at the lower end of that (0), and we increment the IV by 1, then we could always replace the < by a !=, right? This would avoid having to do loop versioning. If this is acceptable, would we do this in the ivopts pass? If so, I guess we could start by explaining to the pass that _4 is an IV. Because right now I see that we determine that everything but _4 is an IV: i_21, i.1_2, _3, i_15. Thoughts/hints greatly appreciated :).
GCC 5 branch is being closed
GCC 6 branch is being closed
The GCC 7 branch is being closed, re-targeting to GCC 8.4.
GCC 8.4.0 has been released, adjusting target milestone.
GCC 8 branch is being closed.
For -O2, since a few optimizations are not enabled (e.g. some loop-based optimizations), the code was not optimized too much. At -O3, now, GCC could vectorize it. While with GCC 4.8, the code was not vectorized. I guess the pain in performance may be mitigated.
If change code as below, 'i' is not starting from '0', and 'compare code' is '!=' then wrap/overflow on 'i' may happen, and optimizations (e.g. vectorization) are not applied. The below patch is trying to optimize this kind of loop. https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570424.html int foo (int *p, unsigned short u_n, unsigned short i) { int x = 0; for (; i != u_n; i++) { x = x + p[i]; } return x; }
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
GCC 9 branch is being closed
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
GCC 10 branch is being closed.