70164 – [11/12/13/14/15 Regression] Code/performance regression due to poor register allocation on Cortex-M0

Bug 70164 - [11/12/13/14/15 Regression] Code/performance regression due to poor register allocation on Cortex-M0

Summary: [11/12/13/14/15 Regression] Code/performance regression due to poor register ...

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	6.0

Importance:	P2 normal
Target Milestone:	11.5
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization, ra

Depends on:
Blocks:

Reported:	2016-03-10 10:58 UTC by Andre Vieira
Modified:	2024-04-26 10:31 UTC (History)
CC List:	6 users (show)

See Also:
Host:
Target:	arm-none-eabi
Build:
Known to work:
Known to fail:
Last reconfirmed:	2016-03-10 00:00:00

Attachments
current ira dump (4.22 KB, text/plain) 2016-03-10 10:58 UTC, Andre Vieira	Details
current reload dump (4.17 KB, text/plain) 2016-03-10 10:59 UTC, Andre Vieira	Details
pre-patch ira dump (4.37 KB, text/plain) 2016-03-10 10:59 UTC, Andre Vieira	Details
pre-patch reload dump (3.15 KB, text/plain) 2016-03-10 11:00 UTC, Andre Vieira	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Andre Vieira 2016-03-10 10:58:53 UTC

Created attachment 37920 [details]
current ira dump

After a quick investigation of the testcase in gcc/testsuite/gcc.target/arm/pr45701-1.c for cortex-m0 on trunk I found out that the test case was failing due to a change in the register allocation after revision r226901.

Before this register allocation would choose to load the global 'hist_verify' onto r6 representing 'old_verify' prior to the function call to pre_process_line. This old_verify is used after the function call. With the patch it decides to load it onto r3, a caller-saved register, which means it has to be spilled before the function call and reloaded after.

Before patch:
history_expand_line_internal:
        push    {r3, r4, r5, r6, r7, lr}
        ldr     r3, .L5
        ldr     r5, .L5+4
        ldr     r4, [r3]
        movs    r3, #0
        ldr     r6, [r5]       ; <--- load of 'hist_verify' onto r6
        movs    r7, r0
        str     r3, [r5]
        bl      pre_process_line
        adds    r6, r4, r6
        str     r6, [r5]
        movs    r4, r0
        cmp     r7, r0
        bne     .L2
        bl      str_len
        adds    r0, r0, #1
        bl      x_malloc
        movs    r1, r4
        bl      str_cpy
        movs    r4, r0
.L2:
        movs    r0, r4
        @ sp needed
        pop     {r3, r4, r5, r6, r7, pc}

Current:
history_expand_line_internal:
        push    {r0, r1, r2, r4, r5, r6, r7, lr}
        ldr     r3, .L3
        ldr     r5, .L3+4
        ldr     r6, [r3]
        ldr     r3, [r5]        ; <--- load of 'hist_verify' onto r3
        movs    r7, r0
        str     r3, [sp, #4]    ; <--- Spill
        movs    r3, #0
        str     r3, [r5]
        bl      pre_process_line
        ldr     r3, [sp, #4]    ; <--- Reload
        movs    r4, r0
        adds    r6, r6, r3
        str     r6, [r5]
        cmp     r7, r0
        bne     .L1
        bl      str_len
        adds    r0, r0, #1
        bl      x_malloc
        movs    r1, r4
        bl      str_cpy
        movs    r4, r0
.L1:
        movs    r0, r4
        @ sp needed
        pop     {r1, r2, r3, r4, r5, r6, r7, pc}


I have also attached the dumps for ira and reload for both pre-patch and current. In the current reload dump insn 9 represents the load onto r3 and insn 62 the spill. In pre-patch ira/reload the load is in insn 10.

I am not familiar with RA in GCC, so I'm not entirely sure what code to blame for this sub-optimal allocation, any comments or pointers would be most welcome.

Comment 1 Andre Vieira 2016-03-10 10:59:26 UTC

Created attachment 37921 [details]
current reload dump

Comment 2 Andre Vieira 2016-03-10 10:59:42 UTC

Created attachment 37922 [details]
pre-patch ira dump

Comment 3 Andre Vieira 2016-03-10 11:00:00 UTC

Created attachment 37923 [details]
pre-patch reload dump

Comment 4 Andre Vieira 2016-03-10 11:02:31 UTC

Revision r226901 is linked to PR64164, so I added Alexandre Oliva to the watch list.

Comment 5 Andre Vieira 2016-03-10 11:05:19 UTC

Ah yes I forgot to mention, this is reproduceable with:
$arm-none-eabi-gcc -mcpu=cortex-m0 -mthumb -Os -S pr45701-1.c

Comment 6 Richard Biener 2016-03-10 11:16:17 UTC

So probably another coalescing "fallout"

Comment 7 Jeffrey A. Law 2016-03-10 21:00:27 UTC

AFAICT the coalescing code is working as expected here.  Working with r226900...

So, the only real statement of interest is:

  # iftmp.0_1 = PHI <iftmp.0_18(3), new_line_9(2)>


Which results in the following partition map:

Partition 0 (iftmp.0_1 - 1 )
Partition 1 (line_7(D) - 7 )
Partition 2 (iftmp.0_18 - 18 )

And the following coalesce list:

Coalesce list: (1)iftmp.0_1 & (18)iftmp.0_18 [map: 0, 2] : Success -> 0

Note that new_line_9 isn't ever processed.  Which is a bit odd to say the least.  Moving to r226901 we have:

Partition 0 (iftmp.0_1 - 1 )
Partition 1 (line_7(D) - 7 )
Partition 2 (new_line_9 - 9 )
Partition 3 (iftmp.0_18 - 18 )

Coalesce list: (1)iftmp.0_1 & (9)new_line_9 [map: 0, 2] : Success -> 0
Coalesce list: (1)iftmp.0_1 & (18)iftmp.0_18 [map: 0, 3] : Success -> 0

Note that we coalesced new_line_9 into the same partition as iftmp.0_{1,18}.

That seems valid given their use in the PHI and my quick review of the conflicts.  

So looking at the actual expansion r226900 will emit a copy from new_line_9 into iftmp.0_{1,18}.  That's a result of r226900 not coalescing the objects.

So AFAICT this isn't a coalescing issue, at least not at the gimple->rtl expansion point.

Comment 8 Richard Biener 2016-03-23 10:56:36 UTC

So it's an RA issue then which previously was mitigated by the extra copy (and thus split life-range).

Comment 9 Jeffrey A. Law 2016-03-23 15:44:37 UTC

I think that's a fair characterization.  The extra copy emitted by the older compiler gives the allocator more freedom.   With coalescing getting more aggressive, the copy is gone and the allocator's freedom is reduced.

I'll try to have a look at what the allocator is doing, but I doubt it's realistically something that can be addressed in this release cycle.

Comment 10 Vladimir Makarov 2016-03-23 17:15:33 UTC

(In reply to Jeffrey A. Law from comment #9)
> I think that's a fair characterization.  The extra copy emitted by the older
> compiler gives the allocator more freedom.   With coalescing getting more
> aggressive, the copy is gone and the allocator's freedom is reduced.
> 
> I'll try to have a look at what the allocator is doing, but I doubt it's
> realistically something that can be addressed in this release cycle.

I am agree.  It will be probably hard to fix in IRA on this stage.

Coalescing is a controversial thing.  Therefore there are so many coalescing algorithms.  I've tried a lot of them when I worked on global RA.  Finally, I found that implicit coalescing worked the best.  The word `implicit` means that we propagate hard register preferences (through copies, including implicit ones for two-operand constraints) from already assigned pseudos to unassigned ones.  When it is possible to assign the same hard register, we do it and remove the copies. Otherwise, we still can assign a hard register which might be impossible after we explicitly coalesced two pseudos.

Only LRA does explicit coalescing for pseudos assigned to memory as we have no constraints on # stack slots and memory-memory moves are expensive and require additional hard reg.

I guess probably this sort of PR could be fixed if we had live-range splitting in any place not only on the loop borders.  But it might create other PRs if it makes a wrong decisions :)  Unfortunately, it is all about heuristics.  They can work successfully in one case and do bad things in another case.  The performance of credible benchmarks should be a criterion.

Comment 11 Jeffrey A. Law 2016-03-23 17:16:15 UTC

So given the conflicts during IRA I can't see a way for IRA to do a better job.  Essentially the key allocno/pseudo wants hard reg 0 to avoid the spillage, but it also conflicts with hard reg 0.

Prior to CSE1 we have the following key statements:

(insn 15 14 16 2 (set (reg/v/f:SI 116 [ <retval> ])
        (reg:SI 0 r0)) j.c:23 748 {*thumb1_movsi_insn}
     (nil))

[ ... 
(jump_insn 19 18 20 2 (set (pc)
        (if_then_else (ne (reg/v/f:SI 117 [ line ])
                (reg/v/f:SI 116 [ <retval> ]))
            (label_ref:SI 36)
            (pc))) j.c:25 756 {cbranchsi4_insn}
     (int_list:REG_BR_PROB 8987 (nil))
 -> 36)

[ ... ]

(insn 21 20 22 3 (set (reg:SI 0 r0)
        (reg/v/f:SI 117 [ line ])) j.c:25 748 {*thumb1_movsi_insn}
     (nil))

[ ... ]
(insn 31 30 36 3 (set (reg/v/f:SI 116 [ <retval> ])
        (reg:SI 0 r0)) j.c:25 748 {*thumb1_movsi_insn}
     (nil))
[ ... ]

(insn 28 27 29 3 (set (reg:SI 1 r1)
        (reg/v/f:SI 117 [ line ])) j.c:25 748 {*thumb1_movsi_insn}
     (nil))

[ ... ]
(insn 37 39 38 4 (set (reg/i:SI 0 r0)
        (reg/v/f:SI 116 [ <retval> ])) j.c:26 748 {*thumb1_movsi_insn}
     (nil))
(insn 38 37 0 4 (use (reg/i:SI 0 r0)) j.c:26 -1
     (nil))


Of particular interest is that r116 is not live-in to bb3.  If you do the full analysis, it can be shown that r116 does not conflict with r0 before cse1.  And that's key because to get the code we want r116 needs to be assigned to r0.

cse (like DOM) has the ability to look at a equality conditional and propagate equivalences into the true or false arm.  ANd if we look at insn 19, we've got a equality conditional between r117 and r116 which will set up an equivalence between r116 and r117 for bb3.

So in bb3, cse1 will change the initial assignment from:

(insn 21 20 22 3 (set (reg:SI 0 r0)
        (reg/v/f:SI 117 [ line ])) j.c:25 748 {*thumb1_movsi_insn}
     (nil))

to:

(insn 21 20 22 3 (set (reg:SI 0 r0)
        (reg/v/f:SI 116 [ <retval> ])) j.c:25 748 {*thumb1_movsi_insn}
     (nil))


Which makes r116 live-in for bb3.  But note that it doesn't change insn 28 (yet).

forwprop then comes along and changes insn 28 to look like:
(insn 28 27 29 3 (set (reg:SI 1 r1)
        (reg/v/f:SI 116 [ <retval> ])) j.c:25 748 {*thumb1_movsi_insn}
     (expr_list:REG_DEAD (reg/v/f:SI 116 [ <retval> ])
        (nil)))

Now r116 is both live-in to bb3 and conflicts with r0 within bb3.

At which point we have lost.

I've got a couple things I want to poke at...  But nothing that I think has a high probability of success.

Comment 12 Jeffrey A. Law 2016-03-23 17:21:23 UTC

Slight correction.  I was looking at the wrong part of the dump when I said cse1 didn't change insn 28.  It is cse1 that changes insn 28.  So this is strictly an issue with the transformations cse1 makes.

Comment 13 Jeffrey A. Law 2016-03-23 19:02:45 UTC

Essentially this is the same problem we have with DOM using context sensitive equivalences to copy propagate into subgraphs, but in CSE.  I'm increasingly of the opinion that such equivalences DOM find should be used for simplification only, not for copy propagation.  That opinion would apply for CSE as well.

I'm not sure if we can put the pieces in place for gcc-6, but I think that's the direction we ought to be going.

The alternative would be to do some kind of range splitting.  What we'd want to know is do we have a context sensitive equivalency and would splitting the range in the dominated subgraph result in a graph that is more easily/better colorable.  In this case, the subgraph creates all the conflicts so it's an obvious split point, but I'm not sure how easily we could generalize that.

Either way I don't think this should be a release blocking issue.  Moving to P2, but keeping the target gcc-6.

Comment 14 Jeffrey A. Law 2016-03-24 16:10:41 UTC

Some further notes.

I was looking at what the impact would be if we just stopped recording the problematical equivalences in CSE, both to see if the equivalences are useful at all, and if they are, to get a sense of when (which might perhaps lead to some useful conditions for recording them).  I was quite surprised at how much of a difference in the resulting code generation these equivalences make.

One of the things that is emerging is that these equivalences are useful when the copy propagation they enable allows one operand to die at the comparison when it didn't die before.  That in turn may allow creation of REG_EQUIV note on the insn that initializes the dying register.  Which then allows substitution of the equivalent memory for the register in the comparison.

We still have the same number of memory references, but we use one less register in that case and have one less instruction.

We obviously don't have that level of insight during CSE.  But given uses/sets DF info, we can get a lot of the way there.

Anyway, just wanted to record some of my findings.  I'm putting this down for now as it's not likely to be a gcc-6 kind of change.

Comment 15 Jakub Jelinek 2016-04-27 10:56:12 UTC

GCC 6.1 has been released.

Comment 16 Andre Vieira 2016-07-01 12:58:48 UTC

Any progress on this one?

Comment 17 Richard Biener 2016-08-22 08:26:57 UTC

GCC 6.2 is being released, adjusting target milestone.

Comment 18 Richard Biener 2016-08-22 08:28:02 UTC

GCC 6.2 is being released, adjusting target milestone.

Comment 19 Jakub Jelinek 2016-12-21 10:55:37 UTC

GCC 6.3 is being released, adjusting target milestone.

Comment 20 Richard Biener 2017-07-04 08:44:18 UTC

GCC 6.4 is being released, adjusting target milestone.

Comment 21 Jakub Jelinek 2018-10-26 10:07:45 UTC

GCC 6 branch is being closed

Comment 22 Richard Biener 2019-11-14 07:54:21 UTC

The GCC 7 branch is being closed, re-targeting to GCC 8.4.

Comment 23 Jakub Jelinek 2020-03-04 09:35:51 UTC

GCC 8.4.0 has been released, adjusting target milestone.

Comment 24 Christophe Lyon 2020-04-10 10:57:40 UTC

With today's trunk, I still see the same problem:

history_expand_line_internal:
        @ args = 0, pretend = 0, frame = 8
        @ frame_needed = 0, uses_anonymous_args = 0
        push    {r0, r1, r2, r4, r5, r6, r7, lr}
        ldr     r3, .L3
        ldr     r6, .L3+4
        ldr     r7, [r3]
        ldr     r3, [r6]       ; ; <--- load of 'hist_verify' onto r3
        movs    r5, r0
        str     r3, [sp, #4]    ; <--- Spill
        movs    r3, #0
        str     r3, [r6]            
        bl      pre_process_line
        ldr     r3, [sp, #4]    ; <--- reload
        movs    r4, r0
        adds    r7, r7, r3
        str     r7, [r6]
        cmp     r5, r0
        bne     .L1
        bl      str_len
        adds    r0, r0, #1
        bl      x_malloc
        movs    r1, r4
        bl      str_cpy
        movs    r4, r0
.L1:
        @ sp needed
        movs    r0, r4
        pop     {r1, r2, r3, r4, r5, r6, r7, pc}
.L4:
        .align  2
.L3:
        .word   a1
        .word   hist_verify

Comment 25 Vladimir Makarov 2020-04-21 15:20:18 UTC

Sorry, I can not reproduce this.  With today trunk I have for pr45701-1.c (-Os -mthumb):

history_expand_line_internal:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        push    {r3, r4, r5, r6, r7, lr}
        mov     r4, r0
        ldr     r5, .L3
        ldr     r3, .L3+4
        ldr     r7, [r5]
        ldr     r6, [r3]
        movs    r3, #0
        str     r3, [r5]
        bl      pre_process_line
        cmp     r4, r0
        add     r6, r6, r7
        str     r6, [r5]
        bne     .L1
        bl      str_len
        adds    r0, r0, #1
        bl      x_malloc
        mov     r1, r4
        pop     {r3, r4, r5, r6, r7, lr}
        b       str_cpy
.L1:
        pop     {r3, r4, r5, r6, r7, pc}

Comment 26 Christophe Lyon 2020-04-21 16:14:15 UTC

For what CPU did you configure GCC?
With today's trunk I still see the same code as in comment #24.

I can get the same code as you have in comment #25 if I force -mcpu=cortex-a9.

The bug report is about cortex-m0, so you either need to configure GCC --with-cpu=cortex-m0 or use -mcpu=cortex-m0

Comment 27 Vladimir Makarov 2020-04-21 19:03:14 UTC

(In reply to Christophe Lyon from comment #26)
> For what CPU did you configure GCC?
> With today's trunk I still see the same code as in comment #24.
> 
> I can get the same code as you have in comment #25 if I force
> -mcpu=cortex-a9.
> 
> The bug report is about cortex-m0, so you either need to configure GCC
> --with-cpu=cortex-m0 or use -mcpu=cortex-m0

Ok.  Thank you.  I've reproduced this with -mcpu=cortex-m0.

Comment 28 Jakub Jelinek 2021-05-14 09:48:05 UTC

GCC 8 branch is being closed.

Comment 29 Richard Biener 2021-06-01 08:07:36 UTC

GCC 9.4 is being released, retargeting bugs to GCC 9.5.

Comment 30 Richard Biener 2022-05-27 09:36:13 UTC

GCC 9 branch is being closed

Comment 31 Jakub Jelinek 2022-06-28 10:32:08 UTC

GCC 10.4 is being released, retargeting bugs to GCC 10.5.

Comment 32 Richard Biener 2023-07-07 10:31:13 UTC

GCC 10 branch is being closed.