Aarch64: More efficient address computations for known offsets.

After !10998 (closed) lands we will generally compute offset addresses in one of two ways:

If we can fit the offset into the store/load directly as immediate we will do so.
If not we will load the full address into a register like any regular value and then use this for the store.

However in some edge cases we can do even better.

Since the store/load allows it's own immediate we can leave part of the address to computation to the store, potentially simplifying the overall number of instructions needed.

One example would be offsets just out of range for the store immediate.

Here are C/Cmm examples of such offsets:

void write(int32_t* a, int b) {
    a[16386] = 1;
};
#include "Cmm.h"

high_bits(I64 x) {
 A:
  I64[x+65544] = 0;
  return();
}

Gcc get's away with just two instructions which is ideal:

        add     x0, x0, 49152
        str     w1, [x0, 16368]

Currently ghc, even with my latest not yet merged changes would produce far worse code:

	mov w17, #8
	movk w17, #1, lsl #16
	add x17, x22, x17
	str x18, [ x17 ]

To improve things we should inside getAmode check if there is any benefit to add a immediate to the store, even if it's not enough to encode the whole offset.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information