GHC issueshttps://gitlab.haskell.org/ghc/ghc/-/issues2023-03-01T08:33:26Zhttps://gitlab.haskell.org/ghc/ghc/-/issues/23029Late lambda lifting for free variables shared inside recursive groups.2023-03-01T08:33:26ZAndreas KlebingerLate lambda lifting for free variables shared inside recursive groups.This is mostly a question for @sgraf812 but seems fitting here as others might have opinions/knowledge on this as well.
The basic scenario is that we have something like:
```haskell
Rec {
foo {fv1,fvs_foo} args = ... foo args ... bar...This is mostly a question for @sgraf812 but seems fitting here as others might have opinions/knowledge on this as well.
The basic scenario is that we have something like:
```haskell
Rec {
foo {fv1,fvs_foo} args = ... foo args ... bar args ...
bar {fv1,fvs_bar} args = ... foo args ... bar args ...
} in
...
let my_fav_thk {fv1} = ... foo args ...
```
Does late lambda lifting currently support transforming this into something like:
```haskell
Rec {
foo {fvs_foo} fv1 args = ... foo fv1 args ... bar fv1 args ...
bar {fvs_bar} fv1 args = ... foo fv1 args ... bar fv1 args ...
} in
...
let my_fav_thk {fv1} = ... foo fv1 args ...
```
I'm currently looking at a codebase where we could save a lot of memory this way. But it doesn't seem to happen and I found no obvious knob to make it happen.https://gitlab.haskell.org/ghc/ghc/-/issues/22659A Tree fold does not optimize very well2024-02-09T09:11:55ZmeooowA Tree fold does not optimize very well## Summary
The following fold function on a rose tree does not optimize very well.
```hs
data Tree a = Node a [Tree a]
foldrTree :: (a -> b -> b) -> b -> Tree a -> b
foldrTree f z t = go t z
where go (Node x ts) z' = f x (foldr go z...## Summary
The following fold function on a rose tree does not optimize very well.
```hs
data Tree a = Node a [Tree a]
foldrTree :: (a -> b -> b) -> b -> Tree a -> b
foldrTree f z t = go t z
where go (Node x ts) z' = f x (foldr go z' ts)
```
But some more complex variations of it do. An example with benchmarks on -O2:
```hs
import Control.DeepSeq
import Criterion.Main
main :: IO ()
main = defaultMain
[ env (pure t_) $ \t -> bgroup ""
[ bench "foldrTree" $ whnf (foldrTree (&&) True) t
, bench "foldrTreeAlt" $ whnf (foldrTreeAlt (&&) True) t
]
]
where
t_ = Node True [Node True [] | _ <- [1..1000000]]
data Tree a = Node a [Tree a]
instance NFData a => NFData (Tree a) where
rnf (Node x ts) = rnf x `seq` rnf ts
foldrTree :: (a -> b -> b) -> b -> Tree a -> b
foldrTree f z t = go t z
where go (Node x ts) z' = f x (foldr go z' ts)
foldrTreeAlt :: (a -> b -> b) -> b -> Tree a -> b
foldrTreeAlt f z t = go t z
where go (Node x ts) = f x . foldr ((.) . go) id ts
```
```
benchmarking foldrTree
time 11.60 ms (11.50 ms .. 11.70 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 11.61 ms (11.58 ms .. 11.65 ms)
std dev 93.16 μs (76.81 μs .. 111.6 μs)
benchmarking foldrTreeAlt
time 5.890 ms (5.869 ms .. 5.910 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 5.943 ms (5.928 ms .. 5.969 ms)
std dev 57.94 μs (43.70 μs .. 88.66 μs)
```
For background, this was encountered in https://github.com/haskell/containers/pull/880
## Expected behavior
The simpler function should ideally optimize as well as the other one.
## Environment
* GHC version used: 9.2.5https://gitlab.haskell.org/ghc/ghc/-/issues/22157Late lambda lifting and multiple arguments2022-09-13T13:27:01ZSimon Peyton JonesLate lambda lifting and multiple argumentsConsider
```
module LL where
f xs x a b c d e f = let g :: Bool -> [a] -> ([a], Int, Int, Int, Int, Int, Int)
g True ys = (ys, a, b, c, d, e, f)
g False ys = g x ys
...Consider
```
module LL where
f xs x a b c d e f = let g :: Bool -> [a] -> ([a], Int, Int, Int, Int, Int, Int)
g True ys = (ys, a, b, c, d, e, f)
g False ys = g x ys
in case g x xs of { (xs', _, _, _, _, _, _) ->
case g x xs' of { (xs'', _, _, _, _, _, _) ->
(xs', xs'') } }
```
Currently late-lambda-lifting does not lift out `g`, because the lifted function has a lot of arguments. But actually lifting up to top level will reduce allocation (by not allocating `g`) in exchange for some stack shuffling. The [wiki page](https://gitlab.haskell.org/ghc/ghc/-/wikis/late-lam-lift) and [draft paper](https://github.com/sgraf812/late-lam-lift/blob/master/paper.pdf) just mutter about register pressure.
My instinct is that lifting is a win, no matter how many arguments -- or at least that the threshold should be significantly larger than currently.
So this ticket is just to suggest that some expermimentation with `-fstg-lift-lams-non-rec-args` and `-fstg-lift-lams-rec-args` could be profitable.
------------------------
I tripped over this when doing some performance debugging on another MR (!7847), in `nofib/real/smallpt`. For some unrelated reason, instead of
```
let g = \xy. blah in foldr k z [a,b,a]
```
we were unrolling the loop to
```
case g c z of r1 -> case g b r1 of r2 -> ...
```
In the former case, once `foldr` was inlined we could inline `g` so in the end it was never allocated. But in the unrolled version it was.https://gitlab.haskell.org/ghc/ghc/-/issues/9374Investigate Static Argument Transformation2021-05-21T09:54:21ZJan Stolarekjan.stolarek@ed.ac.ukInvestigate Static Argument TransformationAt the moment the Static Argument Transformation (SAT) optimisation is not run unless explicitly enabled with `-fstatic-argument-transformation`. There was a comment by Max Bolingbroke in DynFlags that said this (that comment is now remo...At the moment the Static Argument Transformation (SAT) optimisation is not run unless explicitly enabled with `-fstatic-argument-transformation`. There was a comment by Max Bolingbroke in DynFlags that said this (that comment is now removed and replaced with a reference to this ticket):
```
Max writes: I think it's probably best not to enable SAT with -O2 for the
6.10 release. The version of SAT in HEAD at the moment doesn't incorporate
several improvements to the heuristics, and I'm concerned that without
those changes SAT will interfere with some attempts to write "high
performance Haskell", as we saw in some posts on Haskell-Cafe earlier
this year. In particular, the version in HEAD lacks the tail call
criterion, so many things that look like reasonable loops will be
turned into functions with extra (unneccesary) thunk creation.
```
1. 10 was a long time ago. Has anything changed since then? Does it make sense to enable that optimisation now? What are the mentioned heuristics and were they finally implemented? Does anyone know what Haskell-cafe posts does Max refer to?