Making LetUp more precise
The demand analyser uses two different rules for handling let bindings, as explained in the cardinality paper. TLDR:
- LetDown is used for top-level bindings, recursive bindings and local, non-recursive functions. It closely mimics semantics of an actual function call, as it unleashes the demand signature of the bound RHS at the call site within the let body.
- LetUp is used for local, non-recursive thunks. This works more like a type-system: Analysing the let-body reveals its demand on the identifier, in which the RHS of the bindings is analysed exactly once. Only now it is that the results from the body and the RHS get sequentially combined.
There are reasons why we currently need both rules (cardinality paper §3.5 and §3.6), although the general approach of LetDown is more flow-sensitive and thus strictly more precise than LetUp.
Consider the following running example:
let y = ...
in let x = 2*y
in if ... then x else y
This evaluates y
only once, but the LetUp rule (in contrast to LetDown) fails to recognise this:
The inner let-body has a demand environment of [x :-> <L,1*U>, y :-> <L, 1*U>]
, which combines with the DmdEnv of the RHS [y :-> <L, 1*U>
to a total DmdEnv for the let-expression binding x
of [y :-> <L,U>]
.
E.g., y
is possibly evaluated multiple times!
Co-call graphs
Meanwhile, Call Arity, using a LetUp strategy at let-bindings exclusively, has countered the exact same loss of precision by employing Co-call graphs. Instead of having cardinality information stored in a plain identifier map, where identifiers are assumed to be evaluated with each other, co-call graphs additionally assert that some ids are never evaluated during the same evaluation of the containing expression.
For the above example, the co-call graph for the inner let-body would specifically not contain an edge between x
and y
, because they are evaluated on mutually exclusive code paths. Thus, substituting the co-call graph of x
s RHS into the co-call graph of the let-body will not induce a loop on y
, so y will be recognised as evaluated at most once, which is what we want here.
BOTH/OR trees
The idea behind co-call graphs translates to demand analysis by being as lazy as possible in computing DmdEnv
s.
In particular, bothDmdType
and lubDmdType
currently smash together both incoming DmdEnv
s. This forgets important structure for the above example: Namely that x
and y
are evaluated mutually exclusively.
To remedy that, we could model DmdEnv
s as a tree structure (leaving Termination
-based nastiness aside for a moment):
data DmdEnv'
= Empty
| Var Id Demand
| Or DmdEnv' DmdEnv'
| Both DmdEnv' DmdEnv'
flattenDmdEnv' :: DmdEnv' -> DmdEnv
flattenDmdEnv' env =
case env of
Empty -> emptyVarEnv
Var id dmd -> unitVarEnv id dmd
Or env1 env2 -> lubDmdEnv env1 env2
Both env1 env2 -> bothDmdEnv env1 env2
This is essentially the interpreter pattern, with flattenDmdEnv'
being an interpreter in terms of old DmdEnv
.
However, as we'll see in a second, this buys us the ability to perform proper substitution when resolving let-bindings with LetUp, instead of just deleting the bound id from the body env and both
ing with the result of the RHS.
For the above example, we get (Both (...) (Or (Var x <S,1*U>) (Var y <S,1*U>)))
as the DmdEnv'
of the let-body and (Var y <L,1*U>)
as the DmdEnv'
of x
s RHS under the recorded demand <L,1*U>
on x
within the body.
We can now substitute the DmdEnv'
of x
s RHS into all occurences of x
within the DmdEnv'
of the let-body:
(Both (...) (Or env (Var y <S,1*U>)))[env := (Var y <L,1*U>)]
= (Both (...) (Or (Var y <L,1*U>) (Var y <S,1*U>)))
If we look up the demand on y
in the resulting DmdEnv'
, we find that y
is put under demand <L,1*U>
, so used at most once, which is what we wanted.
Note however that it is still not detected as being used strictly!
For this, we would need to analyse x
s RHS under the demand of the use site we substitute the DmdEnv'
into, much like LetDown would.
Let's revisit this later.
Thunk Sharing
There's an(other) issue with the substitution model.
Substitution appeals because it approximates LetDown in terms of precision, but it also suffers from the same imprecision when it comes to modeling thunk sharing:
let y = ..
in let x = 2*y
in x + x
Note that although x
is evaluated twice, the evaluation of its RHS is shared, so that y
is evaluated at most once.
This is also what the current demand analysis finds out.
Let's look at the DmdEnv'
for the inner let-body:
(Both
(Var x <S,1*U>)
(Var x <S,1*U>))
Substituting the DmdEnv'
of x
s RHS:
(Both
(Var y <L,1*U>)
(Var y <L,1*U>))
Ouch! Substitution forgets about sharing and let's us see an imprecise upper bound of w*U
.
What we are lacking is some kind of model for the runtime heap.
How does LetUp solve this?
Well, it operates under the simple assumption that the binding is evaluated exactly once in the entire body, if it's evaluated at all.
So, it's OK to both
the DmdEnv'
of the RHS to that of the let body at the root. This won't destroy sharing!
Now, for the first example (if ... then x else y
), it is quite obvious that the bound id x
doesn't even occur in the first Or
branch of the DmdEnv'
of the let body (after deleting x
from it).
Only the second branch evaluates x
at all, so it should be safe to 'push' the graft point, where we graft the DmdEnv'
of the RHS of x
onto the DmdEnv'
of the let body, down to the ''most recent common ancestor'' (MCRA) of all occurences of x
in the body.
For the first example:
-- [] marks the MCRA of all previous occurences of `x`, which were already deleted
graft (\t -> (Both t (Var y <L,1*U>))) onto (Both (...) (Or [Empty] (Var y <S,1*U>))))
= (Both (...) (Or (Both Empty (Var y <L,1*U>)) (Var y <S,1*U>))))
Multiple graft points
Grafting the DmdEnv'
of the RHS of x
at the MCRA of all occurences of x
in the DmdEnv'
of the body is AFAICT the only solution that doesn't duplicate the DmdEnv'
of the RHS into the tree, so should be quite predictable from a performance perspective.
However, there are shortcomings, exposed by examples like this:
let y = ... in
let x = 2*y in
if ... then
x
else if ... then
y
else if ... then
x
(Both
(...)
(Or
(Var x <S,1*U>)
(Both
(...)
(Or
(Var y <S,1*U>)
(Var x <S,1*U>)))))
The MCRA of all occurences of x
is the root of the DmdEnv'
. That's unfortunate, as this means we aren't more precise than LetUp here, which is insufficient to realise that y
is only evaluated at most once.
This tension can only be resolved by having multiple graft points, one at each occurence of x
.
When is this safe to do?
It's always safe to push down a graft point when there's only one child node in which x
occured (that's why it's safe to choose the MCRA).
For the case when x
occured in both children (can only be Both
or Or
):
- It's safe to push graft points down both branches of an
Or
, as this can't destroy sharing. The only gripe is that the tree grows linear in the number ofOr
nodes with this property compared to the MCRA or plain LetUp approach. - It's not generally safe push graft points down both branches of a
Both
:(Both (Var x <S,1*U>) (Var x <S,1*U>))
was an earlier example that proofs that.
So, by additionally pushing down graft points along both branches of an Or
node, if needed, we can handle cases like the above.
We buy increased precision by possibly super-linear space complexity.
Both
over Or
Distributing There's one additional trick we can apply for ... a simpler model of the DmdEnv'
?
In order to arrive at a disjunctive normal form-like model for DmdEnv'
, we can apply the distributive law (Both env1 (Or env2 env3) = (Or (Both env1 env2) (Both env1 env3))
.
This allows for the following DmdEnv'
:
newtype DmdEnv'' = DNF [DmdEnv]
flattenDmdEnv'' :: DmdEnv'' -> DmdEnv
flattenDmdEnv'' (DNF envs) =
lubDmdEnvs envs
dnf :: DmdEnv' -> DmdEnv''
dnf = DNF . go
where
go env =
case env of
Empty -> [emptyDmdEnv]
Var id dmd -> [unitDmdEnv id dmd]
Or env1 env2 -> go env1 ++ go env2
Both env1 env2 -> concatMap (\env1 -> map (bothDmdEnv env1) (go env2)) (go env1) -- ugh
dnf
shows the biggest drawback of this: This is pretty much exponential in the number Both
nodes.
'Substitution' is pretty easy here: Because there are no Both
nodes (or rather that the top-level is Or
only), we can just substitute into each other.
sgraf: Actually, I'm not sure what this buys us, other than simplicity of implementation. And performance-wise, I have a strong feeling that this blows up pretty fast. Effectively, we model every possible path of execution separately. It's the same as in logic, I guess: DNF is always possible, and easily handled, but conversion blows up exponentially in general.
Termination
The previous ideas greatly simplify by ignoring Termation
/DmdResult
s.
In fact, the bugs encountered while implementing the first proposal were all due to complex interplay with Termination
.
Consider the two DmdType
s {}
(e.g. empty, terminating) and {y-><S,1*U>}x
(uses y
exactly once and throws, like in error y
).
When these are lub
ed, we want the DmdType
{y-><L,1*U>}
(uses y
at most once, possibly terminating) as a result.
The status quo is to compute the new DmdEnv
by plusVarEnv_CD lubDmd env1 (defaultDmd res1) env2 (defaultDmd res2)
.
- Where *either* the left or right
env*
is defined, we compute a new entry bylub
ing, usingdefaultDmd res*
(whereres*
is theDmdResult
of the correspondingDmdType
) when there is no entry in one of theenv*
s. - When there is no entry in either
env*
, then there won't be in the result!
Since this approach gets its precision from delaying lub
s as long as possible, we have to store Termination
information in the DmdEnv'
/DmdEnv''
data structure.
At the least, we have to tag the children of Or
nodes, for the first two approaches (DmdEnv'
) we also have to tag Both
nodes, e.g.
data DmdEnv'
= Empty
| Var Id Demand
| Or DmdEnv' (Termination ()) DmdEnv' (Termination ())
| Both DmdEnv' (Termination ()) DmdEnv' (Termination ())
newtype DmdEnv''
= DNF [(DmdEnv, Termination ())] -- Not quite, see below
Variables not mentioned in either lub
operand are assumed to be used according to defaultDmd res
, where res
is the DmdResult
currently part of the containing DmdType
.
Note that due to ExnStr
/catch#
, the defaultDmd res
can change after computing the actual lub
above, so in general defaultDmd res*
and defaultDmd res
can vary independently!
That's also why [findIdDemand
](https://github.com/sgraf812/ghc/blob/922db3dac896b8cf364c9ebaebf1a27c2468c709/compiler/basicTypes/Demand.hs#L1577) in Demand.hs
takes a DmdResult
as argument.
And that's also why the above definition for DmdEnv''
is incorrect!
It assumes that a single Termination
result per lub
operand is enough, which isn't the case, as the following continued example shows:
We started with lubDmdType {} {y-><S,1*U>}x = {y-><L,1*U>}
(in terms of the status quo).
Now, lub
that again with {z-><S,1*U>}
to get {y-><L,1*U>,z-><L,1*U>}
.
So far so good.
The DNF would look something like [({},Dunno ()), ({y-><S,1*U>},ThrowsExn), ({z-><S,1*U>},Dunno ())]
, which flattenDmdEnv''
s to pretty much the same DmdEnv
as in the DmdType
above.
Now consider what happens when Termination
changed in-between the two lub
s, e.g. because the first result of lubDmdType
was an argument to error
:
lubDmdType (markThrowsExn (lubDmdType {} {y-><S,1*U>}x)) {z-><S,1*U>}
= lubDmdType (markThrowsExn {y-><L,1*U>} ) {z-><S,1*U>}
= lubDmdType {y-><L,1*U>}x {z-><S,1*U>}
= {y-><L,1*U>,z-><S,1*U>}
Note that, different to before, z
is now evaluated strictly!
In our current DNF formulation, we wouldn't be able to express the difference.
The problem here is that lub
is no longer associative, because it depends on the Termination
at the time of the call.
So, we can't actually model DmdEnv''
as a DNF, but have to resort to
data DmdEnv''
= Lit DmdEnv
| Or DmdEnv'' (Termination ()) DmdEnv'' (Termination ())
Which is only marginally less complex than a version of DmdEnv'
where we subsume the Empty
and Var
cases with an equivalent Lit
case:
data DmdEnv'
= Lit DmdEnv
| Or DmdEnv' (Termination ()) DmdEnv' (Termination ())
| Both DmdEnv' (Termination ()) DmdEnv' (Termination ())
dnf :: DmdEnv' -> DmdEnv''
dnf (DmdEnv'.Lit env) = DmdEnv''.Lit env -- hypothetical disambiguating syntax
dnf (DmdEnv'.Or fv1 t1 fv2 t2) = DmdEnv''.Or (dnf fv1) t1 (dnf fv2) t2
dnf (DmdEnv'.Both fv1 t1 fv2 t2) = distr (dnf fv1) t1 (dnf fv2) t2 -- case in point
where
-- both env1 with defaultDmd t1 into every Lit of fv2
distr (DmdEnv''.Lit env1) t1 fv2 t2 = undefined
-- both env2 with defaultDmd t2 into every Lit of fv1
distr fv1 t1 (DmdEnv''.Lit env2) t2 = undefined
-- Not exactly sure how to compute this.
-- This case leads to the exponential explosion,
-- as we have to combine every Lit in fv1 with every Lit in fv2.
-- This gets insane as we have to keep in mind which `Termination` to use for each point
-- in each `Lit` separately.
distr DmdEnv''.Or{} t1 DmdEnv''.Or{} t2 = undefined
dnf
mirrors how bothDmdEnv'
would have to change in the current implementation to get to the DNF-like version.
In fact, the currently working prototype of the first proposal had a complicated bug in the `lookupDmdTree` implementation, which forced me to think about how the lookup of a single point can potentially access *all* Termination ()
tags on the path to a Lit
it is contained in.
This also applies to the implementation of dnf
: I'm pretty sure the step that distributes Both
over both sub trees needs to model all points separately to know with which Termination
to both
.
Sorry if this is a little incomprehensible without an example, but this is already pretty deep down the rabbit hole.
The bottom-line is that there still is some DNF-like structure possible, but in addition to the incurred exponential space complexity, there's also the issue of how to distribute Both
nodes into leafs of the Or-tree without losing sanity to the implementation.
Evaluation
Both/Or trees, grafted at MCRAs
The current working prototype of the first proposal (progress tracked at `and-or` branch) passes ./validate
and nofib.
GHC is built with 0.1% more allocations and 7.1% (!) more executed instructions (cachegrind).
There may still be potential in optimizing the DmdTree
(called DmdEnv'
on this page) representation in its smart constructors (bothDmdTree
, lubDmdTree
) to avoid big trees.
Running nofib revealed two benchmarks where allocation changed, fft2
with -0.1% and transform
with -0.0%.
Changes in counted instructions where all around +-0.0%, with the exception of fft2
(-0.2%).
Seems like a net gain, but pretty much insignificant, especially compared to the significant increase in compilation time.
Both/Or trees, pushed into Or
The implementation of the second proposal can be found here (progress tracked at `and-or-distr` branch. This was a minimal change compared to the single graft point solution, with great repercussions:
Compilation of modules with big records (looking at you, `Distribution.Types.BuildInfo` eats up all resources on my machine without coming to an end.
Given the minimal invasive change, this can only be due to combinatorial explosion.
Some debugging pinned this down to derived Eq
and Show
instances, where the generated DmdTree
s grow bigger than e.g. 10000 nodes.
I'm not entirely sure what causes the duplication that leads to the slowdown, so I tried falling back to the status quo for dictionary Ids (like we do for unlifted lets anyway) to no avail.
DNF
When trying to implement this, I stumbled on the problems wrt. Termination
outlined above.
Also considering that the space complexity of this approach will be even less predictable, I've put the implementation of this is on ice for now.