aarch64 backend generates conditional jumps beyond 2MB of code (fixup value out of range)
The aarch64 backend currently generates conditional jumps to labels with relative addresses that do not fit in the 21 bit limit for conditional jumps. #23451 is an instance of this, and the reproducer posted on that ticket demonstrates that we are particularly at a risk of doing this when emitting unoptimized code that, e.g., contains unfoldings.
Somehow selectively generate far jumps
As the error in #23451 indicates, the
b.hi conditional jump instruction is where the fixup value is out of range. If we instead generated a conditional jump to a non-conditional jump, we could avoid the issue. For example, a jump of the form:
cmp xi, xj b.hi Lbl1 ... code a ... Lbl1: ... code b ...
Could be equivalently (but less efficiently) represented as:
cmp xi, xj b.ls Lbl2 adrp xk, Lbl1@PAGE add xk, xk, Lbl1@PAGEOFF br xk Lbl2: ... code a ... Lbl1: ... code b ...
If we could somehow track whether
code a was over 2MB of code, we could selectively generate far jumps like the latter. Alternatively, we could introduce a flag which causes the aarch64 backend to always generate jumps like the latter, just to give some sure way out of the fixup error.
Generate smaller assembly
Looking at the dumped asm for that reproducer, there is approximately 2105508 bytes of instructions between the jump and the destination (which is just beyond the 2^21=2097152 byte limit). A lot of those instructions look like this:
... mov x5, #54888 movk x5, #65535, lsl #16 movk x5, #65535, lsl #32 movk x5, #65535, lsl #48 add x5, x21, x5 str x6, [ x5 ] mov x6, #54896 movk x6, #65535, lsl #16 movk x6, #65535, lsl #32 movk x6, #65535, lsl #48 add x6, x21, x6 str x8, [ x6 ] mov x8, #54904 movk x8, #65535, lsl #16 movk x8, #65535, lsl #32 movk x8, #65535, lsl #48 add x8, x21, x8 ...
Which is just inefficient subtractions. For example, the first 5 instructions there could instead be just two instructions:
mov x5, #10648 sub x5, x21, x5
There are about 17550 groups of five instructions matching that pattern between the branch and the destination in the output assembly. That means that
17550 groups * 5 instructions/group * 4 bytes/instruction = 351000 bytes is used by those groups of instructions. If all of those groups could be optimized to two instructions, they would instead use
17550 groups * 2 instructions/group * 4 bytes/instruction = 140400 bytes, saving about
351000 - 140400 = 210600 and bringing the jump destination back within reach for this specific minimal example. This is obviously not a general fix for the issue, just something that could make this harder to bump into. I believe @AndreasK plans on addressing this specific aspect of the issue in his work on #23556.
- GHC version used: Should be able to reproduce on
HEAD, I am specifically using
- Operating System: Darwin
- System Architecture: aarch64