Created attachment 37920 [details] current ira dump After a quick investigation of the testcase in gcc/testsuite/gcc.target/arm/pr45701-1.c for cortex-m0 on trunk I found out that the test case was failing due to a change in the register allocation after revision r226901. Before this register allocation would choose to load the global 'hist_verify' onto r6 representing 'old_verify' prior to the function call to pre_process_line. This old_verify is used after the function call. With the patch it decides to load it onto r3, a caller-saved register, which means it has to be spilled before the function call and reloaded after. Before patch: history_expand_line_internal: push {r3, r4, r5, r6, r7, lr} ldr r3, .L5 ldr r5, .L5+4 ldr r4, [r3] movs r3, #0 ldr r6, [r5] ; <--- load of 'hist_verify' onto r6 movs r7, r0 str r3, [r5] bl pre_process_line adds r6, r4, r6 str r6, [r5] movs r4, r0 cmp r7, r0 bne .L2 bl str_len adds r0, r0, #1 bl x_malloc movs r1, r4 bl str_cpy movs r4, r0 .L2: movs r0, r4 @ sp needed pop {r3, r4, r5, r6, r7, pc} Current: history_expand_line_internal: push {r0, r1, r2, r4, r5, r6, r7, lr} ldr r3, .L3 ldr r5, .L3+4 ldr r6, [r3] ldr r3, [r5] ; <--- load of 'hist_verify' onto r3 movs r7, r0 str r3, [sp, #4] ; <--- Spill movs r3, #0 str r3, [r5] bl pre_process_line ldr r3, [sp, #4] ; <--- Reload movs r4, r0 adds r6, r6, r3 str r6, [r5] cmp r7, r0 bne .L1 bl str_len adds r0, r0, #1 bl x_malloc movs r1, r4 bl str_cpy movs r4, r0 .L1: movs r0, r4 @ sp needed pop {r1, r2, r3, r4, r5, r6, r7, pc} I have also attached the dumps for ira and reload for both pre-patch and current. In the current reload dump insn 9 represents the load onto r3 and insn 62 the spill. In pre-patch ira/reload the load is in insn 10. I am not familiar with RA in GCC, so I'm not entirely sure what code to blame for this sub-optimal allocation, any comments or pointers would be most welcome.
Created attachment 37921 [details] current reload dump
Created attachment 37922 [details] pre-patch ira dump
Created attachment 37923 [details] pre-patch reload dump
Revision r226901 is linked to PR64164, so I added Alexandre Oliva to the watch list.
Ah yes I forgot to mention, this is reproduceable with: $arm-none-eabi-gcc -mcpu=cortex-m0 -mthumb -Os -S pr45701-1.c
So probably another coalescing "fallout"
AFAICT the coalescing code is working as expected here. Working with r226900... So, the only real statement of interest is: # iftmp.0_1 = PHI <iftmp.0_18(3), new_line_9(2)> Which results in the following partition map: Partition 0 (iftmp.0_1 - 1 ) Partition 1 (line_7(D) - 7 ) Partition 2 (iftmp.0_18 - 18 ) And the following coalesce list: Coalesce list: (1)iftmp.0_1 & (18)iftmp.0_18 [map: 0, 2] : Success -> 0 Note that new_line_9 isn't ever processed. Which is a bit odd to say the least. Moving to r226901 we have: Partition 0 (iftmp.0_1 - 1 ) Partition 1 (line_7(D) - 7 ) Partition 2 (new_line_9 - 9 ) Partition 3 (iftmp.0_18 - 18 ) Coalesce list: (1)iftmp.0_1 & (9)new_line_9 [map: 0, 2] : Success -> 0 Coalesce list: (1)iftmp.0_1 & (18)iftmp.0_18 [map: 0, 3] : Success -> 0 Note that we coalesced new_line_9 into the same partition as iftmp.0_{1,18}. That seems valid given their use in the PHI and my quick review of the conflicts. So looking at the actual expansion r226900 will emit a copy from new_line_9 into iftmp.0_{1,18}. That's a result of r226900 not coalescing the objects. So AFAICT this isn't a coalescing issue, at least not at the gimple->rtl expansion point.
So it's an RA issue then which previously was mitigated by the extra copy (and thus split life-range).
I think that's a fair characterization. The extra copy emitted by the older compiler gives the allocator more freedom. With coalescing getting more aggressive, the copy is gone and the allocator's freedom is reduced. I'll try to have a look at what the allocator is doing, but I doubt it's realistically something that can be addressed in this release cycle.
(In reply to Jeffrey A. Law from comment #9) > I think that's a fair characterization. The extra copy emitted by the older > compiler gives the allocator more freedom. With coalescing getting more > aggressive, the copy is gone and the allocator's freedom is reduced. > > I'll try to have a look at what the allocator is doing, but I doubt it's > realistically something that can be addressed in this release cycle. I am agree. It will be probably hard to fix in IRA on this stage. Coalescing is a controversial thing. Therefore there are so many coalescing algorithms. I've tried a lot of them when I worked on global RA. Finally, I found that implicit coalescing worked the best. The word `implicit` means that we propagate hard register preferences (through copies, including implicit ones for two-operand constraints) from already assigned pseudos to unassigned ones. When it is possible to assign the same hard register, we do it and remove the copies. Otherwise, we still can assign a hard register which might be impossible after we explicitly coalesced two pseudos. Only LRA does explicit coalescing for pseudos assigned to memory as we have no constraints on # stack slots and memory-memory moves are expensive and require additional hard reg. I guess probably this sort of PR could be fixed if we had live-range splitting in any place not only on the loop borders. But it might create other PRs if it makes a wrong decisions :) Unfortunately, it is all about heuristics. They can work successfully in one case and do bad things in another case. The performance of credible benchmarks should be a criterion.
So given the conflicts during IRA I can't see a way for IRA to do a better job. Essentially the key allocno/pseudo wants hard reg 0 to avoid the spillage, but it also conflicts with hard reg 0. Prior to CSE1 we have the following key statements: (insn 15 14 16 2 (set (reg/v/f:SI 116 [ <retval> ]) (reg:SI 0 r0)) j.c:23 748 {*thumb1_movsi_insn} (nil)) [ ... (jump_insn 19 18 20 2 (set (pc) (if_then_else (ne (reg/v/f:SI 117 [ line ]) (reg/v/f:SI 116 [ <retval> ])) (label_ref:SI 36) (pc))) j.c:25 756 {cbranchsi4_insn} (int_list:REG_BR_PROB 8987 (nil)) -> 36) [ ... ] (insn 21 20 22 3 (set (reg:SI 0 r0) (reg/v/f:SI 117 [ line ])) j.c:25 748 {*thumb1_movsi_insn} (nil)) [ ... ] (insn 31 30 36 3 (set (reg/v/f:SI 116 [ <retval> ]) (reg:SI 0 r0)) j.c:25 748 {*thumb1_movsi_insn} (nil)) [ ... ] (insn 28 27 29 3 (set (reg:SI 1 r1) (reg/v/f:SI 117 [ line ])) j.c:25 748 {*thumb1_movsi_insn} (nil)) [ ... ] (insn 37 39 38 4 (set (reg/i:SI 0 r0) (reg/v/f:SI 116 [ <retval> ])) j.c:26 748 {*thumb1_movsi_insn} (nil)) (insn 38 37 0 4 (use (reg/i:SI 0 r0)) j.c:26 -1 (nil)) Of particular interest is that r116 is not live-in to bb3. If you do the full analysis, it can be shown that r116 does not conflict with r0 before cse1. And that's key because to get the code we want r116 needs to be assigned to r0. cse (like DOM) has the ability to look at a equality conditional and propagate equivalences into the true or false arm. ANd if we look at insn 19, we've got a equality conditional between r117 and r116 which will set up an equivalence between r116 and r117 for bb3. So in bb3, cse1 will change the initial assignment from: (insn 21 20 22 3 (set (reg:SI 0 r0) (reg/v/f:SI 117 [ line ])) j.c:25 748 {*thumb1_movsi_insn} (nil)) to: (insn 21 20 22 3 (set (reg:SI 0 r0) (reg/v/f:SI 116 [ <retval> ])) j.c:25 748 {*thumb1_movsi_insn} (nil)) Which makes r116 live-in for bb3. But note that it doesn't change insn 28 (yet). forwprop then comes along and changes insn 28 to look like: (insn 28 27 29 3 (set (reg:SI 1 r1) (reg/v/f:SI 116 [ <retval> ])) j.c:25 748 {*thumb1_movsi_insn} (expr_list:REG_DEAD (reg/v/f:SI 116 [ <retval> ]) (nil))) Now r116 is both live-in to bb3 and conflicts with r0 within bb3. At which point we have lost. I've got a couple things I want to poke at... But nothing that I think has a high probability of success.
Slight correction. I was looking at the wrong part of the dump when I said cse1 didn't change insn 28. It is cse1 that changes insn 28. So this is strictly an issue with the transformations cse1 makes.
Essentially this is the same problem we have with DOM using context sensitive equivalences to copy propagate into subgraphs, but in CSE. I'm increasingly of the opinion that such equivalences DOM find should be used for simplification only, not for copy propagation. That opinion would apply for CSE as well. I'm not sure if we can put the pieces in place for gcc-6, but I think that's the direction we ought to be going. The alternative would be to do some kind of range splitting. What we'd want to know is do we have a context sensitive equivalency and would splitting the range in the dominated subgraph result in a graph that is more easily/better colorable. In this case, the subgraph creates all the conflicts so it's an obvious split point, but I'm not sure how easily we could generalize that. Either way I don't think this should be a release blocking issue. Moving to P2, but keeping the target gcc-6.
Some further notes. I was looking at what the impact would be if we just stopped recording the problematical equivalences in CSE, both to see if the equivalences are useful at all, and if they are, to get a sense of when (which might perhaps lead to some useful conditions for recording them). I was quite surprised at how much of a difference in the resulting code generation these equivalences make. One of the things that is emerging is that these equivalences are useful when the copy propagation they enable allows one operand to die at the comparison when it didn't die before. That in turn may allow creation of REG_EQUIV note on the insn that initializes the dying register. Which then allows substitution of the equivalent memory for the register in the comparison. We still have the same number of memory references, but we use one less register in that case and have one less instruction. We obviously don't have that level of insight during CSE. But given uses/sets DF info, we can get a lot of the way there. Anyway, just wanted to record some of my findings. I'm putting this down for now as it's not likely to be a gcc-6 kind of change.
GCC 6.1 has been released.
Any progress on this one?
GCC 6.2 is being released, adjusting target milestone.
GCC 6.3 is being released, adjusting target milestone.
GCC 6.4 is being released, adjusting target milestone.
GCC 6 branch is being closed
The GCC 7 branch is being closed, re-targeting to GCC 8.4.
GCC 8.4.0 has been released, adjusting target milestone.
With today's trunk, I still see the same problem: history_expand_line_internal: @ args = 0, pretend = 0, frame = 8 @ frame_needed = 0, uses_anonymous_args = 0 push {r0, r1, r2, r4, r5, r6, r7, lr} ldr r3, .L3 ldr r6, .L3+4 ldr r7, [r3] ldr r3, [r6] ; ; <--- load of 'hist_verify' onto r3 movs r5, r0 str r3, [sp, #4] ; <--- Spill movs r3, #0 str r3, [r6] bl pre_process_line ldr r3, [sp, #4] ; <--- reload movs r4, r0 adds r7, r7, r3 str r7, [r6] cmp r5, r0 bne .L1 bl str_len adds r0, r0, #1 bl x_malloc movs r1, r4 bl str_cpy movs r4, r0 .L1: @ sp needed movs r0, r4 pop {r1, r2, r3, r4, r5, r6, r7, pc} .L4: .align 2 .L3: .word a1 .word hist_verify
Sorry, I can not reproduce this. With today trunk I have for pr45701-1.c (-Os -mthumb): history_expand_line_internal: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 push {r3, r4, r5, r6, r7, lr} mov r4, r0 ldr r5, .L3 ldr r3, .L3+4 ldr r7, [r5] ldr r6, [r3] movs r3, #0 str r3, [r5] bl pre_process_line cmp r4, r0 add r6, r6, r7 str r6, [r5] bne .L1 bl str_len adds r0, r0, #1 bl x_malloc mov r1, r4 pop {r3, r4, r5, r6, r7, lr} b str_cpy .L1: pop {r3, r4, r5, r6, r7, pc}
For what CPU did you configure GCC? With today's trunk I still see the same code as in comment #24. I can get the same code as you have in comment #25 if I force -mcpu=cortex-a9. The bug report is about cortex-m0, so you either need to configure GCC --with-cpu=cortex-m0 or use -mcpu=cortex-m0
(In reply to Christophe Lyon from comment #26) > For what CPU did you configure GCC? > With today's trunk I still see the same code as in comment #24. > > I can get the same code as you have in comment #25 if I force > -mcpu=cortex-a9. > > The bug report is about cortex-m0, so you either need to configure GCC > --with-cpu=cortex-m0 or use -mcpu=cortex-m0 Ok. Thank you. I've reproduced this with -mcpu=cortex-m0.
GCC 8 branch is being closed.
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
GCC 9 branch is being closed
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
GCC 10 branch is being closed.