aarch64 backend generates conditional jumps beyond 2MB of code (fixup value out of range)

Summary

The aarch64 backend currently generates conditional jumps to labels with relative addresses that do not fit in the 21 bit limit for conditional jumps. #23451 is an instance of this, and the reproducer posted on that ticket demonstrates that we are particularly at a risk of doing this when emitting unoptimized code that, e.g., contains unfoldings.

Solutions

Somehow selectively generate far jumps

As the error in #23451 indicates, the b.hi conditional jump instruction is where the fixup value is out of range. If we instead generated a conditional jump to a non-conditional jump, we could avoid the issue. For example, a jump of the form:

    cmp xi, xj
    b.hi Lbl1
    ... code a ...
Lbl1:
    ... code b ...

Could be equivalently (but less efficiently) represented as:

    cmp xi, xj
    b.ls Lbl2
    adrp xk, Lbl1@PAGE
    add xk, xk, Lbl1@PAGEOFF
    br xk
Lbl2:
    ... code a ...
Lbl1:
    ... code b ...

If we could somehow track whether code a was over 2MB of code, we could selectively generate far jumps like the latter. Alternatively, we could introduce a flag which causes the aarch64 backend to always generate jumps like the latter, just to give some sure way out of the fixup error.

Generate smaller assembly

Looking at the dumped asm for that reproducer, there is approximately 2105508 bytes of instructions between the jump and the destination (which is just beyond the 2^21=2097152 byte limit). A lot of those instructions look like this:

...
	mov x5, #54888
	movk x5, #65535, lsl #16
	movk x5, #65535, lsl #32
	movk x5, #65535, lsl #48
	add x5, x21, x5
	str x6, [ x5 ]
	mov x6, #54896
	movk x6, #65535, lsl #16
	movk x6, #65535, lsl #32
	movk x6, #65535, lsl #48
	add x6, x21, x6
	str x8, [ x6 ]
	mov x8, #54904
	movk x8, #65535, lsl #16
	movk x8, #65535, lsl #32
	movk x8, #65535, lsl #48 
	add x8, x21, x8
...

Which is just inefficient subtractions. For example, the first 5 instructions there could instead be just two instructions:

mov x5, #10648 
sub x5, x21, x5

There are about 17550 groups of five instructions matching that pattern between the branch and the destination in the output assembly. That means that 17550 groups * 5 instructions/group * 4 bytes/instruction = 351000 bytes is used by those groups of instructions. If all of those groups could be optimized to two instructions, they would instead use 17550 groups * 2 instructions/group * 4 bytes/instruction = 140400 bytes, saving about 351000 - 140400 = 210600 and bringing the jump destination back within reach for this specific minimal example. This is obviously not a general fix for the issue, just something that could make this harder to bump into. I believe @AndreasK plans on addressing this specific aspect of the issue in his work on #23556.

Environment

GHC version used: Should be able to reproduce on HEAD, I am specifically using 27a472d4a34e3c0d287c6079335abeb4df41530a

Optional:

Operating System: Darwin
System Architecture: aarch64

Edited Jul 27, 2023 by Finley McIlwaine

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information