For the case below: typedef unsigned long int uint64_t; extern uint64_t rand_long (); double phi () { double phi; register uint64_t a, b; const uint64_t mask = 1ULL << 63; int i; /* Pick any two starting points */ a = rand_long (); b = rand_long (); /* Iterate until we approach overflow */ for (i = 0; (i < 64) && !((a | b) & mask); i++) { register uint64_t c = a + b; a = b; b = c; } phi = (double) b / (double) a; return phi; } On aarch64, GCC used floating-point registers for the loop: subs w1, w1, #0x1 fmov d15, d31 fmov d31, x2 b.eq 48 <phi+0x48> // b.none However, keeping "a" and "b" in GENERAL_REGS is much faster, like: mov x19, x0 mov x0, x3 subs w2, w2, #0x1 b.eq 48 The option I used is -Ofast/-O3, with -mtune=generic/neoverse-n1/neoverse-n2/ampere1 .
I've debugged this a bit. From dump file 311r.sched1, the problem seems to have something to do with floatunsdidf2. After processing this instruction, the costs of FP_REGS became 0. (r=107 is "b") Processing insn 31 {floatunsdidf2} (freq=69) 31: r111:DF=uns_float(r107:DI) REG_DEAD r107:DI Alt 0: (0) =w (1) w op 0(r=111) new costs MEM:276 POINTER_AND_FP_REGS:345 FP_REGS:0 FP_LO_REGS:0 FP_LO8_REGS:0 GENERAL_REGS:345 STUB_REGS:345 TAILCALL_ADDR_REGS:345 W12_W15_REGS:345 W8_W11_REGS:345 op 1(r=107) new costs MEM:276 POINTER_AND_FP_REGS:345 FP_REGS:0 FP_LO_REGS:0 FP_LO8_REGS:0 GENERAL_REGS:345 STUB_REGS:345 TAILCALL_ADDR_REGS:345 W12_W15_REGS:345 W8_W11_REGS:345 Alt 1: (0) w (1) ?r op 1(r=107) new costs GENERAL_REGS:138 STUB_REGS:138 TAILCALL_ADDR_REGS:138 W12_W15_REGS:138 W8_W11_REGS:138 Final costs after insn 31 {floatunsdidf2} (freq=69) 31: r111:DF=uns_float(r107:DI) REG_DEAD r107:DI op 0(r=111) MEM:276(+276) POINTER_AND_FP_REGS:345(+345) FP_REGS:0(+0) FP_LO_REGS:0(+0) FP_LO8_REGS:0(+0) GENERAL_REGS:345(+345) STUB_REGS:345(+345) TAILCALL_ADDR_REGS:345(+345) W12_W15_REGS:345(+345) W8_W11_REGS:345(+345) op 1(r=107) MEM:9861(+276) POINTER_AND_FP_REGS:17631(+345) FP_REGS:0(+0) FP_LO_REGS:0(+0) FP_LO8_REGS:0(+0) GENERAL_REGS:138(+138) STUB_REGS:138(+138) TAILCALL_ADDR_REGS:138(+138) W12_W15_REGS:138(+138) W8_W11_REGS:138(+138)
So we have: ``` ;; Equal width integer to fp conversion. (define_insn "<optab><fcvt_target><GPF:mode>2" [(set (match_operand:GPF 0 "register_operand") (FLOATUORS:GPF (match_operand:<FCVT_TARGET> 1 "register_operand")))] "TARGET_FLOAT" {@ [ cons: =0 , 1 ; attrs: type , arch ] [ w , w ; neon_int_to_fp_<Vetype> , simd ] <su_optab>cvtf\t%<GPF:s>0, %<s>1 [ w , ?r ; f_cvti2f , fp ] <su_optab>cvtf\t%<GPF:s>0, %<w1>1 } ) ``` Notice the ? in there for r. Reading https://gcc.gnu.org/onlinedocs/gccint/Multi-Alternative.html maybe this should be ^ instead of ?. ^ is new as of GCC 5
Looks like r9-332-g43d0a8ee88460a added the ? there and caused the regression I do think it should be ^ since the reload (spill) case should use w while normally it try both, w and r. But I could be wrong.