Simon Peyton Jones · c9c37118
--- a/commentary/compiler/loopification.md
+++ b/commentary/compiler/loopification.md
@@ -13,3 +13,191 @@ Here is a summary of relevant links and tickets
  - [\#8285](https://gitlab.haskell.org//ghc/ghc/issues/8285)
  - [\#8793](https://gitlab.haskell.org//ghc/ghc/issues/8793), [\#11372](https://gitlab.haskell.org//ghc/ghc/issues/11372); see comment 15 of [\#8793](https://gitlab.haskell.org//ghc/ghc/issues/8793)) etc, where it seems that we are missing loopification for a simple IO function
  - [\#8585](https://gitlab.haskell.org//ghc/ghc/issues/8585) concerned getting the loop to start *after* the stack check
+
+## Current implementation (Apr 17)
+
+
+Loopification was implemented by Jan Stolarek:
+
+```wiki
+commit d61c3ac186c94021c851f7a2a6d20631e35fc1ba
+Author: Jan Stolarek <jan.stolarek@p.lodz.pl>
+Date:   Thu Aug 29 10:57:04 2013 +0100
+
+    Optimize self-recursive tail calls
+
+    This patch implements loopification optimization. It was described
+    in "Low-level code optimisations in the Glasgow Haskell Compiler" by
+    Krzysztof Woś, but we use a different approach here. Krzysztof's
+    approach was to perform optimization as a Cmm-to-Cmm pass. Our
+    approach is to generate properly optimized tail calls in the code
+    generator, which saves us the trouble of processing Cmm. This idea
+    was proposed by Simon Marlow. Implementation details are explained
+    in Note [Self-recursive tail calls].
+
+    Performance of most nofib benchmarks is not affected. There are
+    some benchmarks that show 5-7% improvement, with an average improvement
+    of 2.6%. It would require some further investigation to check if this
+    is related to benchamrking noise or does this optimization really
+    help make some class of programs faster.
+
+    As a minor cleanup, this patch renames forkProc to forkLneBody.
+    It also moves some data declarations from StgCmmMonad to
+    StgCmmClosure, because they are needed there and it seems that
+    StgCmmClosure is on top of the whole StgCmm* hierarchy.
+```
+
+
+It's not a massive patch (580 lines of diff).
+
+
+There is a succession of small follow-up patches. It's controlled by `-loopification`, which on by default with `-O` and `-O2`.
+
+## New idea: use join points
+
+
+The basic idea is this: instead of doing loopification in the
+code generator, express it directly in Core using join points.
+
+
+Consider
+
+```wiki
+  f = \x. letrec g n y = if n>x then y
+                                  else g (n+1) (y+n)
+          in g 0 0
+```
+
+
+GHC will spot g as a join point, and we'll get a nice tight loop with a direct jump.
+No need for the loopification patch!
+
+
+But suppose the program was instead
+
+```wiki
+  f = \x xs. letrec g n y = if n>x then y
+                                   else g (n+1) (y+n)
+             in map (g 0) xs
+```
+
+
+Now g is not tail called, so GHC won't turn it ito a join point.
+We'll get a heap allocated closure for g (unless we lambda lift it, which is a separate matter), and things are Not So Good.
+
+
+Idea: transform the program thus:
+
+```wiki
+  f = \x xs. let g n y = joinrec gj n y = if n>x then y
+                                                 else gj (n+1) (y+n)
+                         in gj n y
+             in map (g 0) xs
+```
+
+
+Now we can (and in fact do) generate a nice loop:
+
+```wiki
+f = \ (x_aSH :: Int) (xs_aSI :: [Int]) ->
+      map @ Int @ Int
+        (\ (y_aSL :: Int) ->
+           case y_aSL of { GHC.Types.I# ww1_s2aH ->
+           joinrec {
+             $wgj_s2aJ [InlPrag=[0], Occ=LoopBreaker]
+               :: GHC.Prim.Int# -> GHC.Prim.Int# -> Int
+             [LclId[JoinId(2)], Arity=2, Str=<S,U><S,U>m, Unf=OtherCon []]
+             $wgj_s2aJ (ww2_s2aD :: GHC.Prim.Int#) (ww3_X2aY :: GHC.Prim.Int#)
+               = case x_aSH of { GHC.Types.I# y1_a27U ->
+                 case GHC.Prim.tagToEnum# @ Bool (GHC.Prim.># ww2_s2aD y1_a27U) of {
+                   False ->
+                     jump $wgj_s2aJ
+                       (GHC.Prim.+# ww2_s2aD 1#) (GHC.Prim.+# ww3_X2aY ww2_s2aD);
+                   True -> GHC.Types.I# ww3_X2aY
+                 }
+                 }; } in
+           jump $wgj_s2aJ 0# ww1_s2aH
+           })
+        xs_aSI
+```
+
+### The main transformation
+
+
+The main transformation is this.  If we have a recursive binding
+
+```wiki
+  rec { f x = ...f... }
+```
+
+
+where the recursive calls to `f` are all saturated tail calls, then replace the binding with
+
+```wiki
+  nonrec { f x = join
+                   f x = ...f...
+                 in f x }
+```
+
+
+(We are using shadowing here, to avoid making up a new variable name.)
+
+### Top level
+
+
+We can do this at top level too.
+
+```wiki
+   g n y = if n>x then y
+                  else g (n+1) (y+n)
+===>
+   g n y = joinrec gj n y = if n>x then y
+                                   else gj (n+1) (y+n)
+```
+
+
+Now when g is called it'll run in a tight local loop with good code generation.
+This is definitely better.  Again, it means that we don't need the loopifiation patch.
+
+**Side note**: if you try this today, you'll actually find that GHC
+
+- Spots the join point
+- But then full laziness floats it to top level so it's not a join point.
+
+
+We should stop making full laziness float join points!
+
+### Mutual recursion
+
+
+All this works for mutually recursive join points.  (The loopification patch does not.)
+
+```wiki
+letrec
+   f x = ....g e1....
+   g y = ....f e2...
+in
+   ...f...
+
+===>
+
+letrec
+   f x = joinrec fj x = ....gj e1....
+                 gj y = ....fj e2....
+         in fj x
+in
+   ...f...
+```
+
+
+(assuming of course that the uses of f,g in their own RHSs
+are all saturated tail calls).
+
+
+The body of the letrec must use only *one* of the functions
+in the recursive group (although it does not need to use
+it in a join-pointy way).  I can't see how to make this work
+if both `f` and `g` are used.
+
+
+At top level, a "use" includes exports.