Late lambda lifting
Overview
Late Lambda Lifting is a notverywellexplored optimisation for GHC. The basic idea is to transform
f x = let g y = ...y...x...
in ...(g v1)...(g v2)...
===>
g' x y = ...y...x...
g x = ...(g' x v1)...(g' x v2)...
That is, pretty standard lambdalifting. The idea is that instead of allocating a closure for g
, we pass extra parameter(s) to it. More below, under "Background".
Since heap allocation is expensive, this has the possibility of making programs run faster. The challenge is all about getting consistent speedups.
There is a ticket to track progress: #9476 (closed). There's a paper in the making at https://github.com/sgraf812/latelamlift/blob/master/paper.pdf.
Tickets
See the late lambda lifting label.
Implementation
The implementation is spread over three modules:
StgLiftLams
├── Analysis
├── LiftM
└── Transformation
Transformation
contains the actual lambdalifting implementation on STG
programs. For the decision of whether or not to lift a binding, it calls out
into Analysis.goodToLift
. The transformation produces and ultimately runs
effects in LiftM
, which nicely encapsulates nitty gritty details such as
dealing with substitution, conjuring fresh binders, cost centre stacks and
caffyness.
Flags
Late lambda lifting is configurable via these flags:
data GeneralFlag =
...
 Opt_StgLiftLam  ^ Enable late lambda lifting.
 fstgliftlams  Enabled by default in O*.
...
data DynFlags = DynFlags {
...
liftLamsRecArgs :: Maybe Int,  ^ Maximum number of arguments after lambda lifting a
 recursive function.
 fstgliftlamsrecargs
liftLamsNonRecArgs :: Maybe Int,  ^ Maximum number of arguments after lambda lifting a
 nonrecursive function.
 fstgliftlamsnonrecargs
liftLamsKnown :: Bool,  ^ Lambda lift even when this turns a known call
 into an unknown call.
 fstgliftlamsknown
...
}
Short wrapup on history
The original work by Nicolas Frisby started out in 2013 as part of his internship on the wip/llf
branch.
Sebastian Graf has rebased this branch in mid April 2018. After some debugging
and fixups, it passed ./validate
(modulo some compiler perf tests) in
c1f16ac.
The most recent variant of Nicolas' original Core transformation can be found
here: https://github.com/sgraf812/ghc/tree/llf.
In July 2018, Sebastian argued that it's probably a good idea to reimplement the transformation on STG instead of Core, the promising implementation of which is available here.
If you want to know more about the original work on Core, look into the history of this Wiki page (before October 2018).
Background
The primary objective of late lambda lifting is to replace heap allocation of closures by an increase in the number of arguments passed, by floating local functions to toplevel.
The following examples aren't strictly in STG, so assume that all nontrivial arguments are given a name and all closures are properly annotated with their free variables.
Consider lambdalifting f
out of before
.
before a b =
let f x = RHS[f, b, x]
in BODY[f, a, b]
Lifting f
out of before
would yield:
$lf b x = RHS[$lf b, b, x]
after a b = BODY[$lf b, a, b]
Note that all occurrences of f
in RHS
and BODY
have been
replaced by the entire expression.
before1
(an instance of before
) is the basic motivating case for
the LLF.
before1 a b c =
let f1 x y = (b + x) + (c + y)
in f1 a b * f1 c a
$lf1 b c x y = (b + x) + (c + y)
after1 a b c =
$lf1 b c a b * $lf1 b c c a
In this case, after1
allocates less on the heap than does
before1
. before1
allocates a closure for f1
and then enters it
twice. after1
does no such allocation, but it does pass more arguments
when entering the statically allocated $f1
. If before1
is called
numerous times, then this transformation could drastically reduce the
overall program allocation. The runtime effect of the additional
arguments may be negligible.
When to lift
The considerations of this section can be found in StgLiftLams.Analysis
.
Why lambdalift late?
We lift late in the pipeline primarily because lambdalifting is known to drastically interfere with many of the other optimizations in the pipeline.
Ultimately, we lambdalift at the end of the STG pipeline, just before code gen. In fact, we like to think about LLF as a code gen strategy that turns closures into extra arguments when doing so is viable.
Syntactic Consequences of Lifting
Lifting a dynamic function closure has four immediate consequences.

C1 It eliminates the let.

C2 It creates a new toplevel definition.

C3 It replaces all occurrences of the lifted function in the let's body with a (partial) application. E.g., all occurences of
f1
were replaced by$lf1 b
. 
C4 All nontoplevel variables that occured in the let's RHS become occurrences of parameters.
Operational Consequences of Lifting
The resulting operational consequences can be good or bad, both statically and
dynamically. These consequences finally lead to a number of heuristics,
implemented in StgLiftLams.Analysis.goodToLift
.

C1 reduces heap allocation, unless the function is letnoescape (LNE), which don't allocate to begin with. Consequently, we don't lift LNEs.

C3 might increase heap allocation and runtime by creating PAPs if the lifted function was previously a function argument. While this won't always increase allocation, it certainly will only improve allocations if this pushes allocation away from the hot path into some cold leaf. We don't allow this, because it also significantly complicates the transformation for arguably very questionable gain.

C3 might increase or decrease heap allocation. If
f
occurs in a closure in the let body, then
it can increase heap allocation, if the additional arguments to
f
did not previously occur in that closure 
it can decrease heap allocation, if those additional arguments did already occur in the closure.
We call the sum result of A and B closure growth. Estimating closure growth is paramount to achieving unambiguously good results. See "Closure Growth" below.


C4 might increase stack allocation and/or increase register pressure, depending on how many arguments the new toplevel function function would take. We determined 5 extra arguments (the number of available free argument registers on x8664) to be a sweet spot for cut off.

C4 might cause a slowdown because it can convert known calls into unknown calls when the lifted function previously closed over a function. We don't perform the lift in this case.
Finally, we obviously only lift functions, no closures or constructor applications.
Closure Growth
Both A and B above could hold within the same let body. Moreover, A or B could hold wrt a closure that is under a (multishot) lambda; so that the changes in heap allocation could happen 0 or billions of times. The conservative choice, then, is that A under a lambda infinitely increases the heap allocation and B under a lambda is ignored.
We look at demand information for a more precise, yet still conservative estimate: Usage information can be harnessed to identify oneshot bodies for A under lambdas, while strictness information can be used to propagate the benefits of B under lambdas.
StgLiftLams.Analysis.closureGrowth
estimates closure growth, operating
on a skeleton of the actual STG tree. Skeletons are an abstraction, only
capturing the relevant syntactic constructs. E.g., closures with their fvs,
RHSs with relative cardinality, sequential composition, choice and a neutral
element.
This is so that not every lifting decision has to traverse the entire syntax
tree, while also keeping closureGrowth
plain and simple. To this purpose, each
let binder of the STG syntax tree is tagged with the Skeleton of its RHS.
From the resulting closure growth, the effects of C1 are subtracted to determine whether the lift would be beneficial.