aarch64 backend generates conditional jumps beyond 2MB of code (fixup value out of range)
Summary
The aarch64 backend currently generates conditional jumps to labels with relative addresses that do not fit in the 21 bit limit for conditional jumps. #23451 is an instance of this, and the reproducer posted on that ticket demonstrates that we are particularly at a risk of doing this when emitting unoptimized code that, e.g., contains unfoldings.
Solutions
Somehow selectively generate far jumps
As the error in #23451 indicates, the b.hi
conditional jump instruction is where the fixup value is out of range. If we instead generated a conditional jump to a non-conditional jump, we could avoid the issue. For example, a jump of the form:
cmp xi, xj
b.hi Lbl1
... code a ...
Lbl1:
... code b ...
Could be equivalently (but less efficiently) represented as:
cmp xi, xj
b.ls Lbl2
adrp xk, Lbl1@PAGE
add xk, xk, Lbl1@PAGEOFF
br xk
Lbl2:
... code a ...
Lbl1:
... code b ...
If we could somehow track whether code a
was over 2MB of code, we could selectively generate far jumps like the latter. Alternatively, we could introduce a flag which causes the aarch64 backend to always generate jumps like the latter, just to give some sure way out of the fixup error.
Generate smaller assembly
Looking at the dumped asm for that reproducer, there is approximately 2105508 bytes of instructions between the jump and the destination (which is just beyond the 2^21=2097152 byte limit). A lot of those instructions look like this:
...
mov x5, #54888
movk x5, #65535, lsl #16
movk x5, #65535, lsl #32
movk x5, #65535, lsl #48
add x5, x21, x5
str x6, [ x5 ]
mov x6, #54896
movk x6, #65535, lsl #16
movk x6, #65535, lsl #32
movk x6, #65535, lsl #48
add x6, x21, x6
str x8, [ x6 ]
mov x8, #54904
movk x8, #65535, lsl #16
movk x8, #65535, lsl #32
movk x8, #65535, lsl #48
add x8, x21, x8
...
Which is just inefficient subtractions. For example, the first 5 instructions there could instead be just two instructions:
mov x5, #10648
sub x5, x21, x5
There are about 17550 groups of five instructions matching that pattern between the branch and the destination in the output assembly. That means that 17550 groups * 5 instructions/group * 4 bytes/instruction = 351000 bytes
is used by those groups of instructions. If all of those groups could be optimized to two instructions, they would instead use 17550 groups * 2 instructions/group * 4 bytes/instruction = 140400 bytes
, saving about 351000 - 140400 = 210600
and bringing the jump destination back within reach for this specific minimal example. This is obviously not a general fix for the issue, just something that could make this harder to bump into. I believe @AndreasK plans on addressing this specific aspect of the issue in his work on #23556.
Environment
- GHC version used: Should be able to reproduce on
HEAD
, I am specifically using27a472d4a34e3c0d287c6079335abeb4df41530a
Optional:
- Operating System: Darwin
- System Architecture: aarch64