... | @@ -107,7 +107,162 @@ Stores/loads needs to be annotated with `!tbaa` and one of the above four types |
... | @@ -107,7 +107,162 @@ Stores/loads needs to be annotated with `!tbaa` and one of the above four types |
|
%ln1NH1 = load i64* %Sp_Arg, align 8, !tbaa !2
|
|
%ln1NH1 = load i64* %Sp_Arg, align 8, !tbaa !2
|
|
```
|
|
```
|
|
|
|
|
|
## Progress
|
|
## Problems / Optmisations to Solve
|
|
|
|
|
|
|
|
### LLVM Optimisations
|
|
|
|
|
|
Some patches have been pushed that implement TBAA and significantly improve the Alias Analysis story. See ticket [\#5567](https://gitlab.haskell.org//ghc/ghc/issues/5567) for more information. |
|
|
|
|
|
Roman reported that running 'opt -std-compile-opts' gives much better code than running 'opt -O3'.
|
|
|
|
|
|
|
|
**Following is from Roman Leschinskiy**
|
|
|
|
|
|
|
|
|
|
|
|
'-O2 -std-compile-opts' does the trick but it's obviously overkill because
|
|
|
|
it essentially executes the whole optimisation pipeline twice. The crucial
|
|
|
|
passes seem to be loop rotation and loop invariant code motion. These are
|
|
|
|
already executed twice by -O2 but it seems that they don't have enough
|
|
|
|
information then and that something interesting happens in later passes
|
|
|
|
which allows them to work much better the third time.
|
|
|
|
|
|
|
|
### Safe Loads (speculative load)
|
|
|
|
|
|
|
|
|
|
|
|
We want to allow LLVM to speculatively hoist loads out of conditional blocks. Relevant LLVM source code is here:
|
|
|
|
|
|
|
|
- [ SimplifyCFG Source Code](http://llvm.org/docs/doxygen/html/SimplifyCFG_8cpp_source.html)
|
|
|
|
- [ llvm::isSafeToSpeculativelyExecute](http://llvm.org/docs/doxygen/html/namespacellvm.html#a4899ff634bf732c16dd22ecfdafdea7d)
|
|
|
|
- [ LLVM Mailing List Discussion about 'Safe loads'](http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-January/046958.html)
|
|
|
|
|
|
|
|
**Following is from Roman Leshchinskiy**
|
|
|
|
|
|
|
|
|
|
|
|
I've poked around a bit and things are rather complicated. So far I've identified two problems. Here is a small example function:
|
|
|
|
|
|
|
|
```wiki
|
|
|
|
foo as n = loop 0# 0.0##
|
|
|
|
where
|
|
|
|
loop i x
|
|
|
|
| i >=# n = (# x, I# i #)
|
|
|
|
| otherwise = loop (i +# 1#) (x *## indexDoubleArray# as i)
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
This is the interesting C-- bit:
|
|
|
|
|
|
|
|
```wiki
|
|
|
|
saH_ret()
|
|
|
|
cb0:
|
|
|
|
Hp = Hp + 8;
|
|
|
|
if (Hp > I32[BaseReg + 92]) goto cb5;
|
|
|
|
;
|
|
|
|
if (%MO_S_Ge_W32(R1, I32[Sp + 16])) goto cb9;
|
|
|
|
_saO::I32 = R1 + 1;
|
|
|
|
F64[Sp + 0] = %MO_F_Mul_W64(F64[Sp + 0],
|
|
|
|
F64[I32[Sp + 12] + ((R1 << 3) + 8)]);
|
|
|
|
R1 = _saO::I32;
|
|
|
|
Hp = Hp - 8;
|
|
|
|
jump saH_info; // [R1]
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
Look at what indexDoubleArray\# compiles to: F64\[I32\[Sp + 12\] + ((R1 \<\< 3) + 8)\]. We would very much like LLVM to hoist the I32\[Sp+12\] bit (i.e., loading the pointer to the ByteArray data) out of the loop because that might allow all sorts of wonderful optimisation such as promoting it to a register. But alas, this doesn't happen, LLVM leaves the load in the loop. Why? Because it assumes that the load might fail (for instance, if Sp is NULL) and so can't move it past conditionals. We know, of course, that this particular load can't fail and so can be executed speculatively but there doesn't seem to be a way of communicating this to LLVM.
|
|
|
|
|
|
|
|
|
|
|
|
As a quick experiment, I hacked LLVM to accept "safe" annotations on loads and then manually annotated the LLVM assembly generated by GHC and that helped quite a bit. I suppose that's the way to go - we'll have to get this into LLVM in some form and then the backend will have to generate those annotations for loads which can't fail. I assume they are loads through the stack pointer and perhaps the heap pointer unless we're loading newly allocated memory (those loads can't be moved past heap checks). In any case, the stack pointer is the most important thing. I can also imagine annotating pointers (such as Sp) rather than instructions but that doesn't seem to be the LLVM way and it's also less flexible.
|
|
|
|
|
|
|
|
### GHC Heap Check (case merging)
|
|
|
|
|
|
|
|
|
|
|
|
See bug [\#1498](https://gitlab.haskell.org//ghc/ghc/issues/1498)
|
|
|
|
|
|
|
|
**Following is from Roman Leshchinskiy**
|
|
|
|
|
|
|
|
|
|
|
|
I investigated heap check a bit more and it seems to me that it's largely
|
|
|
|
GHC's fault. LLVM does do loop unswitching which correctly pulls out
|
|
|
|
loop-invariant heap checks but that happens fairly late in its pipeline
|
|
|
|
and heap checks interfere with optimisations before that.
|
|
|
|
|
|
|
|
|
|
|
|
However, we really shouldn't be generating those heap checks in the first
|
|
|
|
place. Here is a small example loop:
|
|
|
|
|
|
|
|
```wiki
|
|
|
|
foo as n = loop 0# 0.0##
|
|
|
|
where
|
|
|
|
loop i x
|
|
|
|
| i >=# n = (# (), D# x #)
|
|
|
|
| otherwise = loop (i +# 1#) (x *## indexDoubleArray# as i)
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
This is the C-- that GHC generates:
|
|
|
|
|
|
|
|
```wiki
|
|
|
|
sep_ret()
|
|
|
|
ceO:
|
|
|
|
Hp = Hp + 12;
|
|
|
|
if (Hp > I32[BaseReg + 92]) goto ceT;
|
|
|
|
|
|
|
|
if (%MO_S_Ge_W32(R1, I32[Sp + 16])) goto ceX;
|
|
|
|
|
|
|
|
_seA::I32 = R1 + 1;
|
|
|
|
F64[Sp + 0] = %MO_F_Mul_W64(F64[Sp + 0], F64[I32[Sp + 12] + ((R1 << 3) + 8)]);
|
|
|
|
R1 = _seA::I32;
|
|
|
|
Hp = Hp - 12;
|
|
|
|
jump sep_info ();
|
|
|
|
ceT:
|
|
|
|
I32[BaseReg + 112] = 12;
|
|
|
|
goto ceR;
|
|
|
|
ceR:
|
|
|
|
I32[Sp + 8] = sep_info;
|
|
|
|
I32[BaseReg + 32] = 131327;
|
|
|
|
jump stg_gc_ut ();
|
|
|
|
ceX:
|
|
|
|
I32[Hp - 8] = GHC.Types.D#_con_info;
|
|
|
|
F64[Hp - 4] = F64[Sp + 0];
|
|
|
|
I32[Sp + 16] = Hp - 7;
|
|
|
|
R1 = GHC.Unit.()_closure+1;
|
|
|
|
Sp = Sp + 16;
|
|
|
|
jump (I32[Sp + 4]) ();
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
Note how in each loop iteration, we add 12 to Hp, then do the heap check
|
|
|
|
and then subtract 12 from Hp again. I really don't think we should be
|
|
|
|
generating that and then relying on LLVM to optimise it away.
|
|
|
|
|
|
|
|
|
|
|
|
This happens because GHC commons up heap checks for case alternatives and
|
|
|
|
does just one check before evaluating the case. The relevant comment from
|
|
|
|
CgCase.lhs is this:
|
|
|
|
|
|
|
|
|
|
|
|
A more interesting situation is this:
|
|
|
|
|
|
|
|
```wiki
|
|
|
|
!A!;
|
|
|
|
...A...
|
|
|
|
case x# of
|
|
|
|
0# -> !B!; ...B...
|
|
|
|
default -> !C!; ...C...
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
where !x! indicates a possible heap-check point. The heap checks
|
|
|
|
in the alternatives **can** be omitted, in which case the topmost
|
|
|
|
heapcheck will take their worst case into account.
|
|
|
|
|
|
|
|
|
|
|
|
This certainly makes sense if A allocates. But with vector-based code at
|
|
|
|
least, a lot of the time neither A nor C will allocate **and** C will
|
|
|
|
tail-call A again so by pushing the heap check into !A!, we are now doing
|
|
|
|
it **in** the loop rather than at the end.
|
|
|
|
|
|
|
|
|
|
|
|
It seems to me that we should only do this if A actually allocates and
|
|
|
|
leave the heap checks in the alternatives if it doesn't (perhaps we could
|
|
|
|
also use a common heap check if **all** alternatives allocate). I tried to
|
|
|
|
hack this and see what happens but found the code in CgCase and friends
|
|
|
|
largely incomprehensible. What would I have to change to implement this
|
|
|
|
(perhaps controlled by a command line flag) and is it a good idea at all? |