Partial redundancy elimination and LiberateCase
See also
I've for a long time wondered what Partial Redundancy Elimination (a generalisation of common subexpression elimination) would look like in Core.
What is Partial redundancy elimination?
According to https://link.springer.com/chapter/10.1007/978-3-540-24723-4_12,
Partial redundancy elimination (PRE) is a program transformation that identifies and eliminates expressions that are redundant on at least one (but not necessarily all) execution paths.
Here's an example imperative CFG based on an SSA IR:
┌─────┐
│Start│
┌───┴─────┴──────┐
│ │
┌───────▼──┐ ┌─────▼────┐
│ │1 │x=load [a]│2
└────┬─────┴───┐ └─────┬────┘
│ │ │
│ ┌─────▼────┐ │
│ │ │3 │
│ └─────┬────┘ │
│ │ │
┌────▼─────┐ │ ┌─────▼────┐
│ │4 └───►y=load [a]│5
└────┬─────┘ └─────┬────┘
│ │
│ ┌─────┐ │
└──────► End ◄──────┘
└─────┘
Note that we can't remove the load in basic block 5 because it is not fully redundant with the load in 2: 2 does not dominate 5 and indeed on the path Start->1->3->5->End
the load in 5 is not redundant.
But on path Start->2->5->End
, we load from a
twice, hence the load in 5 is partially redundant there.
Partial redundancy elimination spots this situation and inserts a load in 3, making the load in 5 fully redundant. Now it can be CSE'd/replaced with a phi:
┌─────┐
│Start│
┌───┴─────┴──────┐
│ │
┌───────▼──┐ ┌─────▼────┐
│ │1 │x=load [a]│2
└────┬─────┴───┐ └─────┬────┘
│ │ │
│ ┌─────▼────┐ │
│ │z=load [a]│3 │
│ └─────┬────┘ │
│ │ │
┌────▼─────┐ │ ┌─────▼────┐
│ │4 └───►y=phi(z,x)│5
└────┬─────┘ └─────┬────┘
│ │
│ ┌─────┐ │
└──────► End ◄──────┘
└─────┘
And now there is only ever one load of a
on every path through the CFG.
This can make a huge difference for loops! Imagine a partially redundant load in a loop body. Then it is good to duplicate the loop header (the conditional) and make the load fully redundant. We have basically recovered loop-invariant code motion.
What would PRE look like in Core?
First we should think about what kind of code we might want to make fully redundant. LiberateCase says that case .. of ...
is a good candidate not to do inside a loop, just like a load
in an imperative language. A slightly tweaked example from The liberate-case transformation
:
f = \ t -> if cond
then []
else case v of
V a b -> a : f t
=> the inner f is replaced.
f = \ t -> if cond
then []
else case v of
V a b -> a : (letrec
f = \ t -> case v of
V a b -> a : f t
in f) t
(note the NEED for shadowing)
Note that LiberateCase would duplicate the whole body of f
at every call site! That's wasteful and one reason it is only active in -O2.
Inspired by PRE, instead we could just duplicate the "loop header conditional" if cond then ... else ...
, extracting the then
and else
parts into separate bindings if necessary (I didn't do so for the then
branch here):
f = \t -> letrec {
loop t = case v of
V a b -> a : test t in
test t = if cond then [] else loop t
} in test t
And now loop
is strict in v
(the case v of ...
has become fully redundant). Unfortunately, this is still no better because we don't know where to put the code case v of ...
.
So we inline test
into loop
, once
f = \t -> letrec {
loop t = case v of
V a b -> a : if cond then [] else loop t in
test t = if cond then [] else loop t
} in test t
and then float loop
inside test
:
f = \t -> let test t = if cond then [] else
(letrec loop t = case v of
V a b -> a : if cond then [] else loop t
in loop t)
in test t
And now the case v of ...
can float otuside loop
, because it is used strictly in the letrec. Hence fully redundant in the else
branch, where we put it:
f = \t -> let test t = if cond then [] else
case v of V a b ->
letrec loop t = a : if cond then [] else loop t
in loop t
in test t
(In this case, we could even inline test
unconditionally. At the very least it could become a join point.)
And now we have really good code. Compared to LiberateCase, we didn't need to duplicate the whole body of f
multiple times, we only needed to duplicate the loop header test
(of course, what constitutes a loop header and how big it may become compared to the whole loop body is up for debate).
I proposed a generalisation of LiberateCase inspired by PRE that alleviates LiberateCase's biggest drawback: The potential code size increase. Perhaps that would be enough to do it in -O1.
Taking inspiration in the literature on PRE could even lead to more cases in which it applies, not sure.
I don't plan to work on this, but I needed some place to write this down before I page it out again and have to reconstruct it the next time it's bothering me. Perhaps it's also a nice bachelor's thesis.