114513 – [11/12/13/14/15 Regression] [aarch64] floating-point registers are used when GPRs are preferred

Bug 114513 - [11/12/13/14/15 Regression] [aarch64] floating-point registers are used when GPRs are preferred

Summary: [11/12/13/14/15 Regression] [aarch64] floating-point registers are used when...

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	14.0

Importance:	P2 normal
Target Milestone:	11.5
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization, ra

Depends on:	114741
Blocks:
	Show dependency tree / graph

Reported:	2024-03-28 09:10 UTC by Di Zhao
Modified:	2024-04-26 10:57 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:	aarch64
Build:
Known to work:	8.5.0
Known to fail:	9.3.0
Last reconfirmed:	2024-03-28 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Di Zhao 2024-03-28 09:10:40 UTC

For the case below:

  typedef unsigned long int uint64_t;
  extern uint64_t rand_long ();
  
  double phi ()
  {
    double phi;
    register uint64_t a, b;
    const uint64_t mask = 1ULL << 63;
    int i;
  
    /* Pick any two starting points */
    a = rand_long ();
    b = rand_long ();
  
    /* Iterate until we approach overflow */
    for (i = 0; (i < 64) && !((a | b) & mask); i++)
      {
        register uint64_t c = a + b;
  
        a = b;
        b = c;
      }
  
    phi = (double) b / (double) a;
    return phi;
  }

On aarch64, GCC used floating-point registers for the loop:
    subs    w1, w1, #0x1
    fmov    d15, d31
    fmov    d31, x2
    b.eq    48 <phi+0x48>  // b.none

However, keeping "a" and "b" in GENERAL_REGS is much faster, like:
    mov   x19, x0     
    mov   x0, x3      
    subs  w2, w2, #0x1
    b.eq  48

The option I used is -Ofast/-O3, with -mtune=generic/neoverse-n1/neoverse-n2/ampere1 .

Comment 1 Di Zhao 2024-03-28 09:17:41 UTC

I've debugged this a bit. From dump file 311r.sched1, the problem seems to have something to do with floatunsdidf2. After processing this instruction, the costs of FP_REGS became 0. (r=107 is "b")


    Processing insn 31 {floatunsdidf2} (freq=69)
   31: r111:DF=uns_float(r107:DI)
      REG_DEAD r107:DI
      Alt 0:  (0) =w  (1) w
        op 0(r=111) new costs MEM:276 POINTER_AND_FP_REGS:345 FP_REGS:0 FP_LO_REGS:0 FP_LO8_REGS:0 GENERAL_REGS:345 STUB_REGS:345 TAILCALL_ADDR_REGS:345 W12_W15_REGS:345 W8_W11_REGS:345
        op 1(r=107) new costs MEM:276 POINTER_AND_FP_REGS:345 FP_REGS:0 FP_LO_REGS:0 FP_LO8_REGS:0 GENERAL_REGS:345 STUB_REGS:345 TAILCALL_ADDR_REGS:345 W12_W15_REGS:345 W8_W11_REGS:345
      Alt 1:  (0) w  (1) ?r
        op 1(r=107) new costs GENERAL_REGS:138 STUB_REGS:138 TAILCALL_ADDR_REGS:138 W12_W15_REGS:138 W8_W11_REGS:138
    Final costs after insn 31 {floatunsdidf2} (freq=69)
   31: r111:DF=uns_float(r107:DI)
      REG_DEAD r107:DI
        op 0(r=111) MEM:276(+276) POINTER_AND_FP_REGS:345(+345) FP_REGS:0(+0) FP_LO_REGS:0(+0) FP_LO8_REGS:0(+0) GENERAL_REGS:345(+345) STUB_REGS:345(+345) TAILCALL_ADDR_REGS:345(+345) W12_W15_REGS:345(+345) W8_W11_REGS:345(+345)
        op 1(r=107) MEM:9861(+276) POINTER_AND_FP_REGS:17631(+345) FP_REGS:0(+0) FP_LO_REGS:0(+0) FP_LO8_REGS:0(+0) GENERAL_REGS:138(+138) STUB_REGS:138(+138) TAILCALL_ADDR_REGS:138(+138) W12_W15_REGS:138(+138) W8_W11_REGS:138(+138)

Comment 2 Andrew Pinski 2024-03-28 21:55:36 UTC

So we have:
```
;; Equal width integer to fp conversion.
(define_insn "<optab><fcvt_target><GPF:mode>2"
  [(set (match_operand:GPF 0 "register_operand")
        (FLOATUORS:GPF (match_operand:<FCVT_TARGET> 1 "register_operand")))]
  "TARGET_FLOAT"
  {@ [ cons: =0 , 1  ; attrs: type             , arch  ]
     [ w        , w  ; neon_int_to_fp_<Vetype> , simd  ] <su_optab>cvtf\t%<GPF:s>0, %<s>1
     [ w        , ?r ; f_cvti2f                , fp    ] <su_optab>cvtf\t%<GPF:s>0, %<w1>1
  }
)

```

Notice the ? in there for r.

Reading https://gcc.gnu.org/onlinedocs/gccint/Multi-Alternative.html maybe this should be ^ instead of ?.

^ is new as of GCC 5

Comment 3 Andrew Pinski 2024-03-28 22:01:29 UTC

Looks like r9-332-g43d0a8ee88460a added the ? there and caused the regression

I do think it should be ^ since the reload (spill) case should use w while normally it try both, w and r. But I could be wrong.