GHC issueshttps://gitlab.haskell.org/ghc/ghc/-/issues2024-03-06T19:19:35Zhttps://gitlab.haskell.org/ghc/ghc/-/issues/24471Float out can take quadratic time2024-03-06T19:19:35ZJaro ReindersFloat out can take quadratic time## Summary
I've written a simple Template Haskell program that produces nested expressions. If these programs get large then it can take very long to compile them.
## Steps to reproduce
```haskell
{-# LANGUAGE TemplateHaskell #-}
modu...## Summary
I've written a simple Template Haskell program that produces nested expressions. If these programs get large then it can take very long to compile them.
## Steps to reproduce
```haskell
{-# LANGUAGE TemplateHaskell #-}
module Common where
data List_ a f = Nil_ | Cons_ a f deriving Functor
between alg a b
| a == b = [|| $$alg Nil_ ||]
| otherwise = [|| $$alg (Cons_ a $$(between alg (a + 1) b)) ||]
```
```haskell
{-# LANGUAGE TemplateHaskell #-}
module Foo where
import Common
foo :: (List_ Int a -> a) -> a
foo alg = $$(between [|| alg ||] 0 1000)
```
If we compile this with `ghc -O Foo.hs -ddump-timings` we will see the float out timings. If I change the third argument of between from 1000 we can see how it scales:
| size | 2nd float-out time (ms) |
| ------: | ------: |
| 1000 | 601 |
| 2000 | 2595 |
| 3000 | 5862 |
| 4000 | 10944 |
| 5000 | 18825 |
| 6000 | 25289 |
| 7000 | 34933 |
| 8000 | 49785 |
| 9000 | 61680 |
| 10000 | 76162 |
So it seems float-out takes quadratic time in this case.
In particular, I'm looking at the second float out pass, which looks like this in the dump:
```
*** Float out(FOS {Lam = Just 0,
Consts = True,
OverSatApps = True}) [Foo]:
Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = True}) [Foo]: alloc=205141562384 time=76162.434
```
Also the remaining timings do seem to scale linearly. With code generation taking the most time besides that second float out pass.
## Expected behavior
Compile faster.
## Environment
* GHC version used: 9.8.1Jaro ReindersJaro Reindershttps://gitlab.haskell.org/ghc/ghc/-/issues/23823Error calls in INLINE functions2023-08-15T13:38:06ZSimon Peyton JonesError calls in INLINE functionsHere is a function from `testsuite/test/perf/compiler/SuperRecord.hs`
```
{-# INLINE rcons #-}
rcons (_ := val) lts =
unsafePerformIO $! IO $ \s# ->
case newSmallArray# newSize# (error "No value") s# of
(# s'#, arr# #) ->...Here is a function from `testsuite/test/perf/compiler/SuperRecord.hs`
```
{-# INLINE rcons #-}
rcons (_ := val) lts =
unsafePerformIO $! IO $ \s# ->
case newSmallArray# newSize# (error "No value") s# of
(# s'#, arr# #) ->
case recCopyInto (Proxy :: Proxy lts) lts (Proxy :: Proxy (Sort (l := t ': lts))) arr# s'# of
s''# ->
case writeSmallArray# arr# setAt# (unsafeCoerce# val) s''# of
s'''# ->
case unsafeFreezeSmallArray# arr# s'''# of
(# s''''#, a# #) -> (# s''''#, Rec a# #)
where
!(I# setAt#) =
fromIntegral (natVal' (proxy# :: Proxy# (RecVecIdxPos l (Sort (l := t ': lts)))))
newSize# = size# +# 1#
!(I# size#) = fromIntegral $ natVal' (proxy# :: Proxy# s)
```
That INLINE inlines a copy of the function at every call site, *including the call to `error "No value"`*, which expands to this Core
```
(error
@LiftedRep
@Any
((GHC.Stack.Types.pushCallStack
(unpackCString# "error"#,
GHC.Stack.Types.SrcLoc
(unpackCString# "main"#)
(unpackCString# "SuperRecord"#)
(unpackCString# "SuperRecord.hs"#)
(GHC.Types.I# 106#)
(GHC.Types.I# 35#)
(GHC.Types.I# 106#)
(GHC.Types.I# 40#))
GHC.Stack.Types.emptyCallStack)
`cast` <Co:4> :: GHC.Stack.Types.CallStack
~R# (?callStack::GHC.Stack.Types.CallStack))
(unpackCString# "No value"#))
```
Since it's inside an INLINE unfolding (which promises to inline what the user wrote), it won't get floated out, so we just repeatedly duplicate this thing.
Silly. I can't see an easy way to avoid it (FloatOut doesn't look inside stable unfoldings), so I'm just recording it for now. Perhaps we should float bottoming expressions (and maybe other constants) out of INLINE unfoldings?https://gitlab.haskell.org/ghc/ghc/-/issues/23449FloatOut: Float to top-level only vs. floating iff there are no free variables2023-05-30T15:00:00ZSebastian GrafFloatOut: Float to top-level only vs. floating iff there are no free variablesThis is the config record for FloatOut:
```hs
data FloatOutSwitches = FloatOutSwitches {
floatOutLambdas :: Maybe Int, -- ^ Just n <=> float lambdas to top level, if
-- doing so will abstract over...This is the config record for FloatOut:
```hs
data FloatOutSwitches = FloatOutSwitches {
floatOutLambdas :: Maybe Int, -- ^ Just n <=> float lambdas to top level, if
-- doing so will abstract over n or fewer
-- value variables
-- Nothing <=> float all lambdas to top level,
-- regardless of how many free variables
-- Just 0 is the vanilla case: float a lambda
-- iff it has no free vars
...
floatToTopLevelOnly :: Bool -- ^ Allow floating to the top level only.
}
```
I wonder what the difference is between `floatOutLambdas = Just 0` and `floatToTopLevelOnly = True`. Perhaps it is relevant when the RHS of a freely floatable binding has unlifted type? When would we ever float that? Puzzling.
We seem to use `floatToTopLevelOnly` for static pointer floating exclusively.https://gitlab.haskell.org/ghc/ghc/-/issues/22673Full laziness breaks common zip pattern2023-01-17T15:14:02ZDavid FeuerFull laziness breaks common zip pattern## Summary
The common idiom
```haskell
zip [0..] xs
```
can go wrong even when the numbers are `Int`s, leading to a space leak and poor performance.
## Steps to reproduce
This showed up in `indexed-traversable`. A simple reproductio...## Summary
The common idiom
```haskell
zip [0..] xs
```
can go wrong even when the numbers are `Int`s, leading to a space leak and poor performance.
## Steps to reproduce
This showed up in `indexed-traversable`. A simple reproduction:
```haskell
itraverseList :: Applicative f => (Int -> a -> f b) -> [a] -> f [b]
itraverseList f = traverse (uncurry' f) . zip [0..]
uncurry' :: (a -> b -> c) -> (a, b) -> c
uncurry' f (a, b) = f a b
{-# INLINE uncurry' #-}
```
Compiling with `-O -ddump-simpl` (`-O2` gives something similar) produces
```
Rec {
-- RHS size: {terms: 13, types: 4, coercions: 0, joins: 0/0}
ZipBug.itraverseList_go3 [Occ=LoopBreaker]
:: GHC.Prim.Int# -> [Int]
[GblId, Arity=1, Str=<L>, Unf=OtherCon []]
ZipBug.itraverseList_go3
= \ (x_a1k0 :: GHC.Prim.Int#) ->
GHC.Types.:
@Int
(GHC.Types.I# x_a1k0)
(case x_a1k0 of wild_X1 {
__DEFAULT -> ZipBug.itraverseList_go3 (GHC.Prim.+# wild_X1 1#);
9223372036854775807# -> GHC.Types.[] @Int
})
end Rec }
-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}
ZipBug.itraverseList1 :: [Int]
[GblId,
Unf=Unf{Src=<vanilla>, TopLvl=True, Value=False, ConLike=False,
WorkFree=False, Expandable=False, Guidance=IF_ARGS [] 20 0}]
ZipBug.itraverseList1 = ZipBug.itraverseList_go3 0#
-- RHS size: {terms: 35, types: 47, coercions: 0, joins: 0/2}
itraverseList
:: forall (f :: * -> *) a b.
Applicative f =>
(Int -> a -> f b) -> [a] -> f [b]
[GblId,
Arity=2,
Str=<LP(A,MCM(L),A,LCL(C1(C1(L))),A,A)><LCL(C1(L))>,
Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True, Guidance=IF_ARGS [60 60] 260 60}]
itraverseList
= \ (@(f_aKs :: * -> *))
(@a_aKt)
(@b_aKu)
($dApplicative_aKv :: Applicative f_aKs)
(f1_auu :: Int -> a_aKt -> f_aKs b_aKu) ->
let {
f2_s1jK :: f_aKs [b_aKu]
[LclId]
f2_s1jK
= pure @f_aKs $dApplicative_aKv @[b_aKu] (GHC.Types.[] @b_aKu) } in
letrec {
go_s1kP [Occ=LoopBreaker, Dmd=LCL(C1(L))]
:: [Int] -> [a_aKt] -> f_aKs [b_aKu]
[LclId, Arity=2, Str=<1L><ML>, Unf=OtherCon []]
go_s1kP
= \ (ds_a1kj :: [Int]) (_ys_a1kk :: [a_aKt]) ->
case ds_a1kj of {
[] -> f2_s1jK;
: ipv_a1kn ipv1_a1ko ->
case _ys_a1kk of {
[] -> f2_s1jK;
: ipv2_a1ks ipv3_a1kt ->
GHC.Base.liftA2
@f_aKs
$dApplicative_aKv
@b_aKu
@[b_aKu]
@[b_aKu]
(ZipBug.itraverseList2 @b_aKu)
(f1_auu ipv_a1kn ipv2_a1ks)
(go_s1kP ipv1_a1ko ipv3_a1kt)
}
}; } in
\ (x_a128 :: [a_aKt]) -> go_s1kP ZipBug.itraverseList1 x_a128
```
Note that `itraverseList1` is the list `[0..]`, and may expand without bound.
## Expected behavior
I expect
1. That `zip` would fuse with the enumeration generator, erasing the list of `Int`s.
2. That if `zip` failed to fuse with the enumeration generator, then GHC would at least recognize that an `Int` enumeration is too cheap to float to the top.
Note that expanding the definition of `itraverseList` seems to magically fix the problem:
```haskell
itraverseList :: Applicative f => (Int -> a -> f b) -> [a] -> f [b]
itraverseList f xs = traverse (uncurry' f) . zip [0..] $ xs
```
gives
```
-- RHS size: {terms: 6, types: 5, coercions: 0, joins: 0/0}
ZipBug.itraverseList1 :: forall {b}. b -> [b] -> [b]
[GblId,
Arity=2,
Str=<L><L>,
Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True,
Guidance=ALWAYS_IF(arity=2,unsat_ok=True,boring_ok=True)}]
ZipBug.itraverseList1
= \ (@b_aKs) (ds_a12W :: b_aKs) (ds1_a12X :: [b_aKs]) ->
GHC.Types.: @b_aKs ds_a12W ds1_a12X
-- RHS size: {terms: 38, types: 41, coercions: 0, joins: 0/2}
itraverseList
:: forall (f :: * -> *) a b.
Applicative f =>
(Int -> a -> f b) -> [a] -> f [b]
[GblId,
Arity=3,
Str=<LP(A,MCM(L),A,LCL(C1(C1(L))),A,A)><LCL(C1(L))><1L>,
Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True, Guidance=IF_ARGS [60 60 0] 261 0}]
itraverseList
= \ (@(f_aKq :: * -> *))
(@a_aKr)
(@b_aKs)
($dApplicative_aKt :: Applicative f_aKq)
(f1_aur :: Int -> a_aKr -> f_aKq b_aKs)
(xs_aus :: [a_aKr]) ->
let {
n_s1k2 :: f_aKq [b_aKs]
[LclId]
n_s1k2
= pure @f_aKq $dApplicative_aKt @[b_aKs] (GHC.Types.[] @b_aKs) } in
letrec {
go3_a1kb [Occ=LoopBreaker, Dmd=SCS(C1(L))]
:: GHC.Prim.Int# -> [a_aKr] -> f_aKq [b_aKs]
[LclId, Arity=2, Str=<L><1L>, Unf=OtherCon []]
go3_a1kb
= \ (x_a1kc :: GHC.Prim.Int#) (eta_B0 :: [a_aKr]) ->
case eta_B0 of {
[] -> n_s1k2;
: y_a1jT ys_a1jU ->
GHC.Base.liftA2
@f_aKq
$dApplicative_aKt
@b_aKs
@[b_aKs]
@[b_aKs]
(ZipBug.itraverseList1 @b_aKs)
(f1_aur (GHC.Types.I# x_a1kc) y_a1jT)
(case x_a1kc of wild1_X1 {
__DEFAULT -> go3_a1kb (GHC.Prim.+# wild1_X1 1#) ys_a1jU;
9223372036854775807# -> n_s1k2
})
}; } in
go3_a1kb 0# xs_aus
```
which is exactly what we want.
## Environment
* GHC version used: 9.4.3
Optional:
* Operating System:
* System Architecture:https://gitlab.haskell.org/ghc/ghc/-/issues/22312WorkWrap/FloatOut: Float out strictly unboxed free variables2022-10-25T14:34:48ZSebastian GrafWorkWrap/FloatOut: Float out strictly unboxed free variablesConsider
```hs
f :: (Int,Int) -> Int -> Int
f _ 0 = 0
f x y = g y
where
g y = case x of
(p,q) -> if p+y > 0 then 0
else g (y-1)
```
Note that `f` is not strict in `x`, but `g` is and doesn't even n...Consider
```hs
f :: (Int,Int) -> Int -> Int
f _ 0 = 0
f x y = g y
where
g y = case x of
(p,q) -> if p+y > 0 then 0
else g (y-1)
```
Note that `f` is not strict in `x`, but `g` is and doesn't even need the box. With `-O` (in the absence of LiberateCase), we fail to float out and unbox `x`:
```
f = \ (ds_d1pt :: (Int, Int)) (ds1_d1pu :: Int) ->
case ds1_d1pu of { GHC.Types.I# ds2_d1px ->
case ds2_d1px of ds3_X2 {
__DEFAULT ->
joinrec {
$wg_s1re [InlPrag=NOUSERINLINE[2], Occ=LoopBreaker]
:: GHC.Prim.Int# -> Int
[LclId[JoinId(1)], Arity=1, Str=<L> {d1pt->S!P(S!P(L),A)}, Unf=OtherCon []]
$wg_s1re (ww_s1rc :: GHC.Prim.Int#)
= case ds_d1pt of { (p_atY, q_atZ) ->
case p_atY of { GHC.Types.I# x_a1q0 ->
case GHC.Prim.># (GHC.Prim.+# x_a1q0 ww_s1rc) 0# of {
__DEFAULT -> jump $wg_s1re (GHC.Prim.-# ww_s1rc 1#);
1# -> Lib.f1
}
}
}; } in
jump $wg_s1re ds3_X2;
0# -> Lib.f1
}
}
```
Hence we keep on switching over it in `$wg` instead of just closing over the unboxed `x_a1q0`.
We float lazy work *inside* lambdas called at most once; hence we should also float strict work *outside* lambdas known to be called strictly and possibly many times.
I'm not sure what the best place to do this would be and it seems we lack the information of the outermost possible context in which `x` is used strictly.
But given we had that information, or at least knew "`x` is used strictly in the whole `joinrec`", then
1. We could do the transformation in WW, wrapping an unboxing case around the `joinrec` and rebox it in `$wg`'s body, or
2. We could tweak FloatOut so that it floats out `case x of (p,q) ->` if `x` is strict and unboxed.
Inspired by https://gitlab.haskell.org/ghc/ghc/-/issues/22303. In traditional Dataflow analysis terms, strictness analysis is like very busy expression analysis and FloatOut is like the code hoisting transformation https://ethz.ch/content/dam/ethz/special-interest/infk/inst-cs/lst-dam/documents/Education/Classes/Fall2015/210_Compiler_Design/Slides/extra.pdf.https://gitlab.haskell.org/ghc/ghc/-/issues/21785replicateM space leak with lists2023-01-04T11:11:11ZclaudereplicateM space leak with listsSee mater ticket #1168
## Summary
`base:Control.Monad.replicateM` uses too much memory when `m a = [a]` and the arguments are not small.
## Steps to reproduce
```haskell
import Control.Monad (replicateM)
shapes bias p =
[ s
| m ...See mater ticket #1168
## Summary
`base:Control.Monad.replicateM` uses too much memory when `m a = [a]` and the arguments are not small.
## Steps to reproduce
```haskell
import Control.Monad (replicateM)
shapes bias p =
[ s
| m <- [0 .. p]
, s <- replicateM (fromInteger m + 2) [1 .. p]
, p == sum (zipWith (*) (map (bias +) s) (tail s))
]
main = mapM_ (print . length . shapes 0) [1..10]
```
uses 500MB after 1min wall-clock time, and eventually gets OOM-killed on a 32GB desktop.
## Expected behavior
Run to completion using small constant space.
## Probable cause
`liftA2 (:) f (loop (cnt - 1))` retaining in memory the result of the recursive call to use for each element of the list `f`, which at top level has size `(length f)^(n - 1)`.
## Possible solution
I wrote an alternative implementation solves this for me in my use case, very happy to donate it to this project but I don't know:
2. whether it is more frugal for all `m`
1. whether it is correct for all `m`, or even for `m = []`
3. whether it is faster for all `m`.
It is certainly much more complicated than the current implementation, and I don't know fully how/why it works so effectively.
```haskell
{-# LANGUAGE ApplicativeDo #-}
replicateM' :: Applicative m => Int -> m a -> m [a]
replicateM' n ls = go n (pure id)
where
go n fs
| n <= 0 = do
f <- fs
pure (f [])
| otherwise = go (n - 1) gs
where
gs = do
f <- fs
l <- ls
pure (f . (l:))
```
Unfortunately I don't have resources at the moment to spend on researching/verifying/proving points 1,2,3 above; I imagine this task would be much easier for a ghc developer who would be familiar with the process involved in swapping the implementation and running the test suite?
More details including heap profile graphs at <https://mathr.co.uk/blog/2022-06-25_fixing_replicatem_space_leak.html>
## Environment
* GHC version used: 8.8.4
* Operating System: Debian Bullseye
* System Architecture: amd64
```
$ apt-cache policy ghc
ghc:
Installed: 8.8.4-2
Candidate: 8.8.4-2
Version table:
*** 8.8.4-2 990
990 http://ftp.uk.debian.org/debian bullseye/main amd64 Packages
100 /var/lib/dpkg/status
```
Note: I don't think the replicateM implementation has changed since base-4.13.0.0.https://gitlab.haskell.org/ghc/ghc/-/issues/21608Need a bit more case-of-case in Inital Phase2022-08-19T10:07:47ZSimon Peyton JonesNeed a bit more case-of-case in Inital Phase[This comment](https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7997#note_427680) in !7997 describes an infelicity in floating.
In the nofib `simple` benchmark we have
```
polynomial degree ... = ....(\xyz. foldr k z [1..degree]).......[This comment](https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7997#note_427680) in !7997 describes an infelicity in floating.
In the nofib `simple` benchmark we have
```
polynomial degree ... = ....(\xyz. foldr k z [1..degree])....
```
By the time we get to FloatOut we have (roughly)
```
polynomial degree ... = ....(\xyz. foldr k z (case degree of I# d# -> build (enum 1# d#))...
```
Do we float that `[1..degree]` out of the `\xyz`? Doing so kills fusion; but it shares the `[1..degree]` which in general may be a big win.
General principle: trust the programmer: if fusion can happen, do it, even at the cost of floating.
In HEAD
* Fusion does not happen in `InitialPhase` because the `case` gets in between the `foldr` and the `build`.
* The `case` does not get pulled out of the strict argument because `sm_case_case` is off in `InitialPhase`.
* We don't float because of the bizarre `Case [MFEs]` in SetLevels (c.f. #19001). It's a strict argument position, headed by a `case` so we don't float. I'm trying to get rid of this strange special case.
* So in the subsequent Phase 2 simplification (with `sm_case_case` on) fusion can happen, quite coincidentally.
So in HEAD we get fusion, but only by a fluke; and !7997 removes some of the bizarreness, so the fluke no longer happens.
As a result, `simple` allocates 28% more.
----------------------------
I think the right solution is to make more fusion happen in `InitialPhase`. Maybe we can make `sm_case_case` switch off case-of-case *only for multi-alternative cases*. I *think* those are the ones that are the true target of `sm_case_case=False` (a sadly un-documented design).https://gitlab.haskell.org/ghc/ghc/-/issues/21392Don't float out join points to top-level with an interesting demand2024-02-01T09:42:00ZSebastian GrafDon't float out join points to top-level with an interesting demandDue to a complicated interaction with FloatOut in !7599, I'm seeing roughly the following sequence of steps:
```
==> after DmdAnal
f = ... joinrec go [Dmd=SCS(C1(P(1L,A))),Str=<S><L>] x y =
... case <e> of x' { __DEF...Due to a complicated interaction with FloatOut in !7599, I'm seeing roughly the following sequence of steps:
```
==> after DmdAnal
f = ... joinrec go [Dmd=SCS(C1(P(1L,A))),Str=<S><L>] x y =
... case <e> of x' { __DEFAULT -> go x' y' }
in ... go ...
==> FloatOut + FloatIn strict arg `e`
go [Str=<S><L>]
go = ... go <e> y' ... -- NB: Dropped the eval
f = ... go ...
==> DmdAnal, 2nd run
go [Str=<L><L>] -- Urgh, now lazy!
go = ... go <e> y' ...
f = ... go ...
```
This is what happens:
1. The first run of DmdAnal records a signature `<S><L>` for the join point `go`, assuming a demand of `P(1L,A)` on the join body. NB: DmdAnal currently computes better signatures for join points than for regular functions because their evaluation context is known.
2. The fact that `go` is strict in its first arg allows us to drop the eval on `e` in Core and we get `go e y'` in the recursive call because `go`, which is easier to handle in Core2Core passes.
3. In the meantime, `go` has been simplified enough for the second FloatOut pass to see that it can float to the top-level. Off it goes
4. But on the top-level it's just a regular function where DmdAnal isn't smart enough (yet?) to figure out an accurate evaluation context. So our strictness signature gets worse, `<L><L>`. And now we dropped the eval in (2) but can't recover it! We made the program lazier; in CorePrep we'll let-bind instead of case-bind `e`.
In !7599, we made inlined versions of `addListToUniqDSet` lazier in `GHC.Linker.Loader`, to detrimental effect on `MultiLayerModulesTH_OneShot`, which increased by 4.8%.
In a multi-day effort I could come up with a reproducer, but it only reproduces with !7599 (because `t` has demand `MP(1L,A)` which would be `MP(ML,A)` today. I couldn't reproduce with a demand of `1P(1L,A)` because then `t` will just be turned into a `case` and pushed into the join point):
```hs
module Lib (f) where
import Data.List
import Data.Ord
newtype Unique = U { unU :: Int }
class Uniquable u where getKey :: u -> Unique
instance Uniquable Int where getKey = U
data UMap a = UMap { unS :: ![(Unique,a)], unI :: !Int }
insertBy' f v !xs = insertBy f v xs
{-# NOINLINE insertBy' #-}
addOne :: Uniquable u => UMap a -> u -> a -> UMap a
addOne (UMap set n) x v = UMap (insertBy' (comparing (unU . fst)) (getKey x,v) set) (n+1)
newtype USet u = USet (UMap u)
insertOne :: Uniquable u => USet u -> u -> USet u
insertOne (USet s) x = USet (addOne s x x)
insertMany :: Uniquable u => USet u -> [u] -> USet u
insertMany s vs = foldl' insertOne s (reverse (reverse vs))
seq' = seq
{-# NOINLINE seq' #-}
blah s@(USet m) = unS m `seq'` s
{-# OPAQUE blah #-}
end (USet m) = unS m
{-# NOINLINE end #-}
f :: USet Int -> [Int] -> [(Unique,Int)]
f !xs ys
| length ys == 13 = end $ blah t
| length ys == 23 = reverse $ end $ blah t
| otherwise = []
where
t = insertMany xs (reverse $ reverse $ reverse $ reverse ys)
```
Welp.
My suggestion for a fix: In `SetLevels.destLevel`, only float join points (to the top-level) if their demand is not "interesting", e.g. not just `nCn(C1(L))`. Alternatively, zap their demand signatures when floating. Perhaps a combination of both.
Perhaps a strange artifact is that `go`'s first arg is actually unlifted (in terms of `StrictWorkerId`) and perhaps will be eval'd in Stg2Cmm. But the thunk will be allocated nonetheless which is a waste of ressources. Maybe we could case-bind in CorePrep expressions that go in Unlifted positions. But that isn't strictly the cause of the issue here.Sebastian GrafSebastian Grafhttps://gitlab.haskell.org/ghc/ghc/-/issues/20284Give up on strict-arg and strict-let stuff in Simplifier2021-09-15T08:32:05ZSebastian GrafGive up on strict-arg and strict-let stuff in SimplifierRecently, we saw multiple problems relating to eta-expansion and (case) floating (#19970, #20273) that could be worked around if GHC didn't do the strict-arg transformation
```
f (case e of p -> rhs) ==> case e of p -> f rhs
```
No...Recently, we saw multiple problems relating to eta-expansion and (case) floating (#19970, #20273) that could be worked around if GHC didn't do the strict-arg transformation
```
f (case e of p -> rhs) ==> case e of p -> f rhs
```
Note that the LHS has richer scoping information than the RHS and it's easier to float stuff from the LHS.
Simon suggested in a call that we should see what breaks if we don't do it. CorePrep will ultimately do the transformation ("call-by-value") anyway, so we can just defer doing so until then.https://gitlab.haskell.org/ghc/ghc/-/issues/20269Floating unboxed tuple expressions to top-level2022-11-16T17:19:29ZSebastian GrafFloating unboxed tuple expressions to top-levelConsider this program:
```hs
{-# LANGUAGE MagicHash, UnboxedTuples #-}
module Lib where
blah :: Int -> (# Int, [Int] #)
blah x = (# x, replicate x x #)
{-# NOINLINE blah #-}
foo :: Int -> Int -> Int
foo x =
let go [] = (* x)
...Consider this program:
```hs
{-# LANGUAGE MagicHash, UnboxedTuples #-}
module Lib where
blah :: Int -> (# Int, [Int] #)
blah x = (# x, replicate x x #)
{-# NOINLINE blah #-}
foo :: Int -> Int -> Int
foo x =
let go [] = (* x)
go (x:xs) = go xs . (+ x)
z = case blah 42 of (# x, xs #) -> go (x:xs)
in z
```
Ideally, we'd float out the `blah 42` expression to top-level and eta-expand `z`. Indeed that's what we do if `blah` returns a lifted pair, but here `blah` returns an unboxed (and thus unlifted) pair. Thus, GHC hesitates to float unlifted stuff (to top-level at least, #17521) and we get bad code.
This comes up at the moment whenever we inline the wrapper of a CPR'd function and didn't manage to float out the wrapper call before (reasons unclear). One solution is to defer inlining of the wrapper, which already is the case (wrappers are inlined in the final phase). But for any pass that happens *after* final phase Simplification, we will see these function calls with unlifted result type which we are unable to float out (in the FloatOut phase that follows DmdAnal, perhaps). Besides, it's complicated to think about these phase orderings.
Thus I propose the following transformation, at least for things that can float to top-level:
```hs
{-# LANGUAGE MagicHash, UnboxedTuples #-}
module Lib where
blah :: Int -> (# Int, [Int] #)
blah x = (# x, replicate x x #)
{-# NOINLINE blah #-}
lvl :: (Int, [Int])
lvl = case blah 42 of (# x, xs #) -> (x, xs)
foo :: Int -> Int -> Int
foo x =
let go [] = (* x)
go (x:xs) = go xs . (+ x)
z = case lvl of (x, xs) -> go (x:xs)
in z
```
and then eta-expand `z`, as expected. Not so sure if this is a big win if the thing can't float to top-level, though. Perhaps it still is if it enables further eta-expansion.https://gitlab.haskell.org/ghc/ghc/-/issues/19970Case float wins over eta-expansion (again)2021-09-30T04:56:37ZSebastian GrafCase float wins over eta-expansion (again)After #19001, here is another reproducer that is distilled from `reptile` (edit: I think this reproducer is invalid. Use https://gitlab.haskell.org/ghc/ghc/-/issues/19970#note_372763 instead):
```hs
{-# LANGUAGE UnboxedTuples #-}
impor...After #19001, here is another reproducer that is distilled from `reptile` (edit: I think this reproducer is invalid. Use https://gitlab.haskell.org/ghc/ghc/-/issues/19970#note_372763 instead):
```hs
{-# LANGUAGE UnboxedTuples #-}
import Control.Monad
import Data.List hiding (foldr,length)
import Data.Char
-- It's important that escom gets WW'd for its result to reproduce the issue
-- We'll do so manually here
escom :: [Char] -> [Int] -> [Char]
escom str ns = case wescom str ns of (# x, xs #) -> x:xs
{-# INLINE[2] escom #-}
wescom :: [Char] -> [Int] -> (# Char, [Char] #)
wescom str ns = case '\ESC' : foldr f "" ns of x:xs -> (# x, xs #)
where
f n "" = show n ++ str
f n s = show n ++ "," ++ s
{-# NOINLINE[2] wescom #-}
shapewindow :: [Int] -> [Char]
shapewindow = escom "W" -- x y w h
func :: Int -> [Char]
func mode = escom "b" [mode]
setup :: String
setup = "huge"
{-# NOINLINE setup #-}
potatotile :: [String] -> [Char]
potatotile = escom "W" . map length -- x y w h
{-# NOINLINE potatotile #-}
hash :: String -> Int
hash = foldl' (\acc c -> ord c + acc*31) 0
{-# INLINE hash #-}
salt :: a -> IO a
salt = pure
{-# NOINLINE salt #-}
main = do
replicateM_ 500 $ do
s <- salt "blah"
print $ hash (potatotile (lines s) ++ func 13 ++ func 15)
```
If you look at the simplified Core, you'll see
```hs
-- RHS size: {terms: 17, types: 13, coercions: 0, joins: 0/0}
z_r2jg :: Int -> Int
[GblId]
z_r2jg
= case $wwescom_r2j5 lvl2_r2j7 lvl6_r2jf of
{ (# ww_X1G, ww1_X1H #) ->
\ (w_s2hK :: Int) ->
case w_s2hK of { GHC.Types.I# ww2_s2hM ->
case $s$wgo1_r2jc ww_X1G ww1_X1H ww2_s2hM of ww3_s2ib
{ __DEFAULT ->
GHC.Types.I# ww3_s2ib
}
}
}
Rec {
-- RHS size: {terms: 14, types: 6, coercions: 0, joins: 0/0}
$sgo1_r2jh :: GHC.Prim.Char# -> [Char] -> Int -> Int
[GblId, Arity=3, Str=<L><1L><1P(L)>, Unf=OtherCon []]
$sgo1_r2jh
= \ (sc_s2iK :: GHC.Prim.Char#)
(sc1_s2iL :: [Char])
(eta_B0 [OS=OneShot] :: Int) ->
case eta_B0 of { GHC.Types.I# ipv_s1tt ->
$sgo2_r2ji
(GHC.Prim.+# (GHC.Prim.ord# sc_s2iK) (GHC.Prim.*# ipv_s1tt 31#))
sc1_s2iL
}
-- RHS size: {terms: 21, types: 10, coercions: 0, joins: 0/0}
$sgo2_r2ji :: GHC.Prim.Int# -> [Char] -> Int
[GblId, Arity=2, Str=<L><1L>, Unf=OtherCon []]
$sgo2_r2ji
= \ (sc_s2iO :: GHC.Prim.Int#) (sc1_s2iN :: [Char]) ->
case sc1_s2iN of {
[] -> z_r2jg (GHC.Types.I# sc_s2iO);
: y_a1po ys_a1pp ->
go1_r2jj
ys_a1pp
(case y_a1po of { GHC.Types.C# c#_a1ao ->
GHC.Types.I#
(GHC.Prim.+# (GHC.Prim.ord# c#_a1ao) (GHC.Prim.*# sc_s2iO 31#))
})
}
-- RHS size: {terms: 23, types: 12, coercions: 0, joins: 0/0}
go1_r2jj :: [Char] -> Int -> Int
[GblId, Arity=2, Str=<1L><LP(L)>, Unf=OtherCon []]
go1_r2jj
= \ (ds_a1pl :: [Char]) (eta_B0 [OS=OneShot] :: Int) ->
case ds_a1pl of {
[] -> z_r2jg eta_B0;
: y_a1po ys_a1pp ->
case eta_B0 of { GHC.Types.I# ipv_s1tt ->
go1_r2jj
ys_a1pp
(case y_a1po of { GHC.Types.C# c#_a1ao ->
GHC.Types.I#
(GHC.Prim.+# (GHC.Prim.ord# c#_a1ao) (GHC.Prim.*# ipv_s1tt 31#))
})
}
}
```
Note that the rec group around `$sgo1` is actually strict in the `Int`! But because `z_r2jg` doesn't get a useful strictness signature for arity 1, we don't see it. That `z` has arity 0 is entirely due to an early case float of the `$wwescom` expression, which is not floated out further itself (so that we get a cheap `case lvl of ...`). I lack a bit of context to decide where and how to fix this, but my bet would be to float out the `$wescom` app to top-level.
This regression costs us about 20% of allocs in `reptile`, as I found out in #5075 and then much later in !5667.
I know that, because if you don't worker/wrapper `escom`, e.g.
```hs
escom :: [Char] -> [Int] -> (# Char, [Char] #)
escom str ns = '\ESC' : foldr f "" ns
where
f n "" = show n ++ str
f n s = show n ++ "," ++ s
{-# NOINLINE escom #-}
```
Then you get good code (after !5667, which won't WW `escom` anymore), e.g. all `go` functions unbox the `Int#` because of better strictness info as a result of more eta expansion (see -dverbose-core2core).
It's related to #18793 and #16570.https://gitlab.haskell.org/ghc/ghc/-/issues/19555INLINE Pragma prevents join point from being floated out2021-03-22T17:27:45ZTarmeanINLINE Pragma prevents join point from being floated out## Summary
The [Data.Text.Internal.Search.indices](https://github.com/haskell/text/blob/d5118aa5a8301cfff4ff55b2e7325900c34ebdeb/src/Data/Text/Internal/Search.hs) function can become significantly slower in some cases when compiled with...## Summary
The [Data.Text.Internal.Search.indices](https://github.com/haskell/text/blob/d5118aa5a8301cfff4ff55b2e7325900c34ebdeb/src/Data/Text/Internal/Search.hs) function can become significantly slower in some cases when compiled with -O2. Reproducing this depends on the calling context, the INLINE pragma, and strictness, making the underlying issue fairly hard to pin down. I have tried to create a minimal test but cannot claim to understand why this happens.
[This comment](https://github.com/haskell/text/pull/219#issuecomment-800693669) mentions some (probably) relevant issues.
## Steps to reproduce
Compile the following code with -O1 or -O2:
module Foo (wrapper) where
-- INLINE or variants like INLINE[1], but not INLINABLE
{-# INLINE loop #-}
-- inlined function contains a nested joinrec loop
loop :: Int -> Int
loop x0 = go x0 0
where
go x acc
| x >= 1000000 = acc
| otherwise = go (x+1) (acc+step)
where
-- joinrec loop has a nested joinrec loop that could be floated out
step = buildTable 0 x0
-- the buildTable type signature must be missing
-- buildTable :: Int -> Int -> Int
buildTable a i
| a == i = i
| otherwise = buildTable (a+1) i
-- Nontrivial call site
wrapper :: Int -> Int
wrapper i
| i <= 10 = i
| otherwise = loop i
## Expected behavior
buildTable should be floated out of the go loop. This happens if buildTable has an explicit type signature, wrapper is simpler, or the INLINE pragma is removed:
Rec {
-- RHS size: {terms: 14, types: 3, coercions: 0, joins: 0/0}
$wbuildTable
= \ ww ww1 ->
case ==# ww ww1 of {
__DEFAULT -> $wbuildTable (+# ww 1#) ww1;
1# -> ww1
}
end Rec }
-- RHS size: {terms: 44, types: 11, coercions: 0, joins: 1/1}
wrapper
= \ x0 ->
case x0 of ww { I# ww ->
case >=# ww 1000000# of {
__DEFAULT ->
case $wbuildTable 0# ww of ww { __DEFAULT ->
joinrec {
$wgo ww ww
= case >=# ww 1000000# of {
__DEFAULT -> jump $wgo (+# ww 1#) (*# (+# ww ww) 3#);
1# -> I# ww
}; } in
jump $wgo (+# ww 1#) (*# ww 3#)
};
1# -> I# 0#
}
}
## Actual behavior
The loop remains nested
$wwrapper
= \ ww ->
case <=# ww 10# of {
__DEFAULT ->
joinrec {
$wgo ww ww
= case >=# ww 1000000# of {
__DEFAULT ->
joinrec {
$wbuildTable ww ww
= case ==# ww ww of {
__DEFAULT -> jump $wbuildTable (+# ww 1#) ww;
1# -> jump $wgo (+# ww 1#) (*# (+# ww ww) 3#)
}; } in
jump $wbuildTable 0# ww;
1# -> ww
}; } in
jump $wgo ww 0#;
1# -> ww
}
## Environment
* GHC version used: 8.10.4
Optional:
* Operating System: Windows
* System Architecture: x86-64https://gitlab.haskell.org/ghc/ghc/-/issues/19223Motivation for let/app invariant2021-10-25T10:03:39ZSebastian GrafMotivation for let/app invariant## Motivation
From time to time I wonder why it's ok to allow unlifted let bindings and then have to re-read the let/app invariant.
And after that, I'm still unconvinced that there is a good reason *why* we have unlifted let bindings and...## Motivation
From time to time I wonder why it's ok to allow unlifted let bindings and then have to re-read the let/app invariant.
And after that, I'm still unconvinced that there is a good reason *why* we have unlifted let bindings and not just `case` expressions.
I think one of the reason is because then we have an easier time in FloatOut/SetLevels, but it seems like it already is aware of speculation for `Case`, see `Note [Floating single-alternative cases]`.
## Proposal
Get rid of the let/app invariant and case-bind all unlifted expressions in Core. Let bindings must always be lifted. Deal with the fall-out by calling `exprOkForSpeculation` on scrutinees to see whether they can be treated how unlifted let-bindings are treated at the moment.https://gitlab.haskell.org/ghc/ghc/-/issues/19001FloatOut discards strict evalution context2022-09-27T15:33:47ZSebastian GrafFloatOut discards strict evalution contextConsider
```hs
module Lib (h) where
g :: Int -> (Int,Int)
g n = (n, n+1)
{-# NOINLINE g #-}
h :: Int -> Int
h _ = snd (g 2)
```
we get the simplified Core
```
$wg_sHs
= \ w_sHo ->
(# w_sHo, case w_sHo of { I# x_aGX -> I# (+#...Consider
```hs
module Lib (h) where
g :: Int -> (Int,Int)
g n = (n, n+1)
{-# NOINLINE g #-}
h :: Int -> Int
h _ = snd (g 2)
```
we get the simplified Core
```
$wg_sHs
= \ w_sHo ->
(# w_sHo, case w_sHo of { I# x_aGX -> I# (+# x_aGX 1#) } #)
lvl_sHJ = I# 2#
lvl_sH8
= case $wg_sHs lvl_sHJ of { (# ww_sHt, ww_sHu #) ->
(ww_sHt, ww_sHu)
}
h = \ _ -> case lvl_sH8 of { (ds1_aGM, y_aGN) -> y_aGN }
```
Note how `g 2` was floated out and thus separated from its strict `snd` evaluation context in `h`.
In this case that inhibits us from seeing (with #18894) that every call to `g` also evaluates the second component of the returned pair (and never looks into the first).
I'd much rather see the whole expression `snd (g 2)` floated out, so including the surrounding `case`:
```
$wg_sHs
= \ w_sHo ->
(# w_sHo, case w_sHo of { I# x_aGX -> I# (+# x_aGX 1#) } #)
lvl_sHJ = I# 2#
lvl_sH8
= case $wg_sHs lvl_sHJ of { (# ww_sHt, ww_sHu #) ->
ww_sHu
}
h = \ _ -> lvl_sH8
```
------------------
This is partly caused by `Note [Case MFEs]` in SetLevels. See also https://gitlab.haskell.org/ghc/ghc/-/issues/14564#note_145965https://gitlab.haskell.org/ghc/ghc/-/issues/18231FloatOut should only eta-expand a dead-ending RHS when arity will increase2022-12-14T10:28:46ZSebastian GrafFloatOut should only eta-expand a dead-ending RHS when arity will increaseEta-expansion can turn trivial RHSs into non-trivial RHSs. That leads to more top-level bindings after the next Simplifier run, meaning more churn. We should therefore only eta-expand (at least trivial RHSs) if we actually need to (OTOH ...Eta-expansion can turn trivial RHSs into non-trivial RHSs. That leads to more top-level bindings after the next Simplifier run, meaning more churn. We should therefore only eta-expand (at least trivial RHSs) if we actually need to (OTOH that's only the case for CorePrep).
FloatOut is currently a bit careless in that regard. Consider
```hs
module Lib where
import Control.Monad (forever)
import Control.Monad.Trans.State.Strict
inc :: State Int ()
inc = modify' (+1)
m :: State Int ()
m = forever inc
```
After the second FloatOut (after demand analysis), we have
```
...
m :: State Int ()
m = ...
Rec {
-- RHS size: {terms: 6, types: 1, coercions: 0, joins: 0/0}
lvl_sOv
:: GHC.Prim.Int# -> Data.Functor.Identity.Identity ((), Int)
[LclId, Arity=1, Str=<L,U>b, Cpr=b]
lvl_sOv
= \ (x_aNg :: GHC.Prim.Int#) ->
a'_sO0 (GHC.Types.I# (GHC.Prim.+# x_aNg 1#))
-- RHS size: {terms: 6, types: 3, coercions: 0, joins: 0/0}
a'_sO0 [Occ=LoopBreaker]
:: Int -> Data.Functor.Identity.Identity ((), Int)
[LclId, Arity=1, Str=<L,U>b, Cpr=b]
a'_sO0
= \ (s1_aNP :: Int) ->
case s1_aNP of { GHC.Types.I# x_aNg [Dmd=<B,A>] -> lvl_sOv x_aNg }
end Rec }
-- RHS size: {terms: 3, types: 1, coercions: 5, joins: 0/0}
a :: State Int ()
[LclIdX,
Arity=1,
Str=<B,1*H>b,
Cpr=b,
Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True,
Guidance=ALWAYS_IF(arity=0,unsat_ok=True,boring_ok=True)
Tmpl= a'_sO0
`cast` (Sym (Control.Monad.Trans.State.Strict.N:StateT[0]
<Int>_N <Data.Functor.Identity.Identity>_R <()>_N)
:: (Int -> Data.Functor.Identity.Identity ((), Int))
~R# StateT Int Data.Functor.Identity.Identity ())}]
a = (\ (eta_B1 :: Int) -> a'_sO0 eta_B1)
`cast` (Sym (Control.Monad.Trans.State.Strict.N:StateT[0]
<Int>_N <Data.Functor.Identity.Identity>_R <()>_N)
:: (Int -> Data.Functor.Identity.Identity ((), Int))
~R# StateT Int Data.Functor.Identity.Identity ())
```
Both `lvl` and the eta-expanded RHS of `a` (which will lead to yet another top-level binding after the next simplifier run) are unnecessary.
Also we end up allocating an `I#` box in each loop iteration (in `lvl`). But that is probably due to the Simplifier refraining from inlining dead-ending functions. Or SetLevels. I don't know. I think we should inline that, but don't know on what grounds and if we would punish code using `error` by that. I have to understand `Note [Bottoming floats]`. Or maybe it's because we don't do strictness WW for dead-ending functions. Anyway, the allocation isn't too concerning, but the extra binding is annoying.https://gitlab.haskell.org/ghc/ghc/-/issues/18056Float out prevents RULES from firing.2020-04-17T22:26:23ZAndreas KlebingerFloat out prevents RULES from firing.## Motivation
Consider this function and rule:
```haskell
foo :: Int -> Bool
foo x = L.elem x [1,2::Int]
{-# RULES
"elem/build" [2] forall x (g :: forall b . (Int -> b -> b) -> b -> b)
. L.elem x (build g) = g (\ y r -> (x == y) ||...## Motivation
Consider this function and rule:
```haskell
foo :: Int -> Bool
foo x = L.elem x [1,2::Int]
{-# RULES
"elem/build" [2] forall x (g :: forall b . (Int -> b -> b) -> b -> b)
. L.elem x (build g) = g (\ y r -> (x == y) || r) False
#-}
```
If compiled with -O -fno-full-laziness this fires as expected and produces:
```
foo
= \ (x_ay4 :: Int) ->
case x_ay4 of { I# x1_a1rE ->
case x1_a1rE of {
__DEFAULT -> GHC.Types.False;
1# -> GHC.Types.True;
2# -> GHC.Types.True
}
}
```
which is what we want.
If we compile using -O we get after FloatOut during the InitialPhase:
```
lvl_s1vK :: forall b. (Int -> b -> b) -> b -> b
lvl_s1vK
= \ f xs ->
f 1 (f 2 xs)
list :: [Int]
list = build lvl_s1vK
foo :: Int -> Bool
foo = \ x -> L.elem x list
```
However we never match on `L.elem x list` even though it would match the rule if the referenced bindings where inlined.
## Proposal
I'm not sure how this would be best addressed.Research neededhttps://gitlab.haskell.org/ghc/ghc/-/issues/16039'GHC.Magic.noinline <var>' should not float out2020-01-23T19:38:55ZGabor Greif'GHC.Magic.noinline <var>' should not float outWhile working on #15155, I discovered that the magic function `GHC.Magic.noinline` tricked the float-out transformation into performing undesirable operations:
Looking at the Core produced:
```hs
{ ghc-prim-0.5.3:GHC....While working on #15155, I discovered that the magic function `GHC.Magic.noinline` tricked the float-out transformation into performing undesirable operations:
Looking at the Core produced:
```hs
{ ghc-prim-0.5.3:GHC.Types.I# x_azMX [Dmd=<S,U>] ->
case x_azMX of {
__DEFAULT -> jump $j_sMly;
3674937295934324920# ->
src<compiler/typecheck/TcTyClsDecls.hs:3561:9-35>
check_ty_roles_sLb3
(src<compiler/typecheck/TcTyClsDecls.hs:3561:24-26> env_aqfK)
(src<compiler/typecheck/TcTyClsDecls.hs:3561:28-31> role_aqfL)
(ghc-prim-0.5.3:GHC.Magic.noinline @ Kind liftedTypeKind)
}
}
```
After simplification we end up with a toplevel variable basically redirecting to an imported variable.
```hs
-- RHS size: {terms: 2, types: 1, coercions: 0, joins: 0/0}
lvl330_r12qn :: Kind
[GblId]
lvl330_r12qn
= ghc-prim-0.5.3:GHC.Magic.noinline @ Kind liftedTypeKind
```
This would cause `IND_STATIC`s when eventually emitted as assembly.
IIRC `GHC.Magic.noinline` was introduced with GHC v8.2. This probably regressed the float-out optimisation.
I have a patch to reclassify `noinline <some-var>` as cheap, so that it won't be floated out.
N.B.: for more details, see: https://github.com/ghc/ghc/pull/224
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 8.2.1 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"'GHC.Magic.noinline <var>' should not float out","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"8.8.1","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"8.2.1","keywords":["FloatOut"],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"While working on #15155, I discovered that the magic function `GHC.Magic.noinline` tricked the float-out transformation into performing undesirable operations:\r\n\r\nLooking at the Core produced:\r\n{{{#!hs\r\n { ghc-prim-0.5.3:GHC.Types.I# x_azMX [Dmd=<S,U>] ->\r\n case x_azMX of {\r\n __DEFAULT -> jump $j_sMly;\r\n 3674937295934324920# ->\r\n src<compiler/typecheck/TcTyClsDecls.hs:3561:9-35>\r\n check_ty_roles_sLb3\r\n (src<compiler/typecheck/TcTyClsDecls.hs:3561:24-26> env_aqfK)\r\n (src<compiler/typecheck/TcTyClsDecls.hs:3561:28-31> role_aqfL)\r\n (ghc-prim-0.5.3:GHC.Magic.noinline @ Kind liftedTypeKind)\r\n }\r\n }\r\n}}}\r\nAfter simplification we end up with a toplevel variable basically redirecting to an imported variable.\r\n{{{#!hs\r\n-- RHS size: {terms: 2, types: 1, coercions: 0, joins: 0/0}\r\nlvl330_r12qn :: Kind\r\n[GblId]\r\nlvl330_r12qn\r\n = ghc-prim-0.5.3:GHC.Magic.noinline @ Kind liftedTypeKind\r\n}}}\r\nThis would cause `IND_STATIC`s when eventually emitted as assembly.\r\n\r\nIIRC `GHC.Magic.noinline` was introduced with GHC v8.2. This probably regressed the float-out optimisation.\r\n\r\nI have a patch to reclassify `noinline <some-var>` as cheap, so that it won't be floated out.\r\n\r\nN.B.: for more details, see: https://github.com/ghc/ghc/pull/224","type_of_failure":"OtherFailure","blocking":[]} -->Gabor GreifGabor Greifhttps://gitlab.haskell.org/ghc/ghc/-/issues/15606Don't float out lets in between lambdsa2019-07-07T18:03:46ZSimon Peyton JonesDon't float out lets in between lambdsaConsider
```
f = \x. let y = <blah>
in \z. let v = h x y in <stuff>
```
The full laziness pass will float out that v-binding thus
```
f = \x. let y = <blah>
v = h x y
in \z. <stuff>
```
And now (if `h` is,...Consider
```
f = \x. let y = <blah>
in \z. let v = h x y in <stuff>
```
The full laziness pass will float out that v-binding thus
```
f = \x. let y = <blah>
v = h x y
in \z. <stuff>
```
And now (if `h` is, say, imported) it'll stay like that.
But suppose `<blah>` simlifies to `Just x`. Then we allow ourselves to eta-expand thus
```
f = \x z. let y = <blah>
v = h x y
in <stuff>
```
Now (an early design choice in the let-floater) we never float the v-binding in between the `\x` and `\z`.
This is very non-confluent: a smal change in exactly how rapidly `<blah>` simplifies can
have a big, irreversible effect on the code for `f`.
IDEA: extend the let-floater's design choice to not float out between two lambdas, even if
they are separated by lets/cases etc. One way to say this is to ask when a lambda gets
a new level number compared to its immediately enclosing lambda.
Examples where `y` gets the same level number as `x`
- `\x.\y. blah`
- `\x. let binds in \y`
- `\x. case scrut of pi -> \y.blah`
Examples where `y` gets the a level number one bigger than `x`
- `\x. let v = \y.rhs in blah`
- `\x. f (\y.rhs)`
This probably won't make a lot of difference, but it'd be worth trying
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 8.4.3 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Don't float out lets in between lambdsa","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"8.6.1","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"8.4.3","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"Consider\r\n{{{\r\nf = \\x. let y = <blah>\r\n in \\z. let v = h x y in <stuff>\r\n}}}\r\nThe full laziness pass will float out that v-binding thus\r\n{{{\r\nf = \\x. let y = <blah>\r\n v = h x y\r\n in \\z. <stuff>\r\n}}}\r\nAnd now (if `h` is, say, imported) it'll stay like that.\r\n\r\nBut suppose `<blah>` simlifies to `Just x`. Then we allow ourselves to eta-expand thus\r\n{{{\r\nf = \\x z. let y = <blah>\r\n v = h x y\r\n in <stuff>\r\n}}}\r\nNow (an early design choice in the let-floater) we never float the v-binding in between the `\\x` and `\\z`.\r\n\r\nThis is very non-confluent: a smal change in exactly how rapidly `<blah>` simplifies can\r\nhave a big, irreversible effect on the code for `f`.\r\n\r\nIDEA: extend the let-floater's design choice to not float out between two lambdas, even if\r\nthey are separated by lets/cases etc. One way to say this is to ask when a lambda gets\r\na new level number compared to its immediately enclosing lambda.\r\n\r\nExamples where `y` gets the same level number as `x`\r\n* `\\x.\\y. blah`\r\n* `\\x. let binds in \\y`\r\n* `\\x. case scrut of pi -> \\y.blah`\r\n\r\nExamples where `y` gets the a level number one bigger than `x`\r\n* `\\x. let v = \\y.rhs in blah`\r\n* `\\x. f (\\y.rhs)`\r\n\r\nThis probably won't make a lot of difference, but it'd be worth trying\r\n","type_of_failure":"OtherFailure","blocking":[]} -->8.6.1https://gitlab.haskell.org/ghc/ghc/-/issues/8457-ffull-laziness does more harm than good2020-11-17T17:20:15Zerrge-ffull-laziness does more harm than goodIn this bug report I'd like to argue that `-ffull-laziness` shouldn't
be turned on automatically with either `-O` nor `-O2`, because it's
dangerous and can cause serious memory leaks which are hard to debug
or prevent. I'll also try to s...In this bug report I'd like to argue that `-ffull-laziness` shouldn't
be turned on automatically with either `-O` nor `-O2`, because it's
dangerous and can cause serious memory leaks which are hard to debug
or prevent. I'll also try to show that its optimization benefits are
negligible. Actually, my benchmarks show that it's beneficial to turn
it off even in the cases where we don't hit a space leak.
We've met this issue last week, but it had been reported several times
before: e.g. #917 and #5262.
A typical example is the following:
```haskell
main :: IO ()
main = task () >> task ()
task :: () -> IO ()
task () = printvalues [1..1000000 :: Int]
printvalues :: [Int] -> IO ()
printvalues (x:xs) = print x >> printvalues xs
printvalues [] = return ()
```
We succeed with `-O0`, but fail with `-O`:
```
errge@curry:~/tmp $ ~/tmp/ghc/inplace/bin/ghc-stage2 -v0 -O0 -fforce-recomp lazy && ./lazy +RTS -t >/dev/null
<<ghc: 1620098744 bytes, 3117 GCs, 32265/42580 avg/max bytes residency (3 samples), 2M in use, 0.00 INIT (0.00 elapsed), 1.28 MUT (1.28 elapsed), 0.02 GC (0.02 elapsed) :ghc>>
errge@curry:~/tmp $ ~/tmp/ghc/inplace/bin/ghc-stage2 -v0 -O -fforce-recomp lazy && ./lazy +RTS -t >/dev/null
<<ghc: 1444098612 bytes, 2761 GCs, 3812497/13044272 avg/max bytes residency (7 samples), 28M in use, 0.00 INIT (0.00 elapsed), 1.02 MUT (1.03 elapsed), 0.12 GC (0.12 elapsed) :ghc>>
```
28M? What the leak!? Well, it's `-ffull-laziness`:
```
errge@curry:~/tmp $ ~/tmp/ghc/inplace/bin/ghc-stage2 -v0 -O -fno-full-laziness -fforce-recomp lazy && ./lazy +RTS -t >/dev/null
<<ghc: 1484098612 bytes, 2835 GCs, 34812/42580 avg/max bytes residency (2 samples), 1M in use, 0.00 INIT (0.00 elapsed), 1.04 MUT (1.04 elapsed), 0.02 GC (0.02 elapsed) :ghc>>
```
We get constant space and the fastest run-time too, since we spare
some cycles on GC.
Note, that in this instance we are trying to explicity disable sharing
by using `()` as a fake argument for the function. Also note, that
this function may easily be a utility function in a larger code base
or in a library, therefore it's impractical to say that you shouldn't
use it twice "too close together".
Quoting from the GHC user guide:
```
-O2:
Means: “Apply every non-dangerous optimisation, even if it means
significantly longer compile times.”
The avoided “dangerous” optimisations are those that can make
runtime or space worse if you're unlucky. They are normally turned
on or off individually.
At the moment, -O2 is unlikely to produce better code than -O.
```
This seems to be false at the moment.
We decided to make a broader investigation into this issue and wanted
to know if we can disable this optimization without too much pain.
Came up with this benchmark plan:
> - let's benchmark GHC,
- compile all stages with -O, but hack the stage1 compiler to
> emit `-t` statistics for every file compiled,
- gather these statistics while compiling the libraries and the
> stage2 compiler.
On the second run we compile the stage1 compiler with
`-O -fno-full-laziness`, but leave everything else unchanged in the
environment.
When we have both results of the compilation of \~1600 files, we match
them up and compute the (logarithmic) ratio of CPU and memory
difference between compilations, the final results for our benchmark.
The results and the raw data can be found at
https://github.com/errge/notlazy.
The overall compilation time dropped from 26:20 to 25:12, which is a
4% improvement. Investigating the full matching shows that this
overall result is from small improvements all around the place.
The results plotted:
> - https://github.com/errge/notlazy/blob/master/cpu.png
> - https://github.com/errge/notlazy/blob/master/mem.png
The graphs show the logarithmic (100\*log_10(new/orig)) ratio of change
in cpu and memory consumption. Therefore negative results mean that
the new compilation method is faster.
As can be seen on the CPU graph, in most of the cases the difference
is negligible (actually smaller than what can be measured on small
files, this is why we have the spike at 0). In overall we see a small
improvement in CPU, and there are some outliers in both directions,
but there are more drastic improvement cases than drastic regressions.
On the memory graph the situation is much more close to zero. There
is one big positive memory outlier: `DsListComp.lhs`. It uses 69M
originally and now uses 103M. But compiles in 2 seconds both ways and
there are files in the source tree which requires 400M to compile, so
this is not an issue.
After all this, I'd like to hear other opinions about just disabling
this optimization in `-O` and `-O2` and leaving it as an option that
can be turned on when needed, my reasons once more:
- it's unsafe,
- it's hard to debug when you hit its issues,
- the optimization doesn't seem to be very productive,
- it's always easy to force sharing, but it's not easy to force
> copying.
Apparently a Haskell programmer should be lazy, but never fully lazy.
Research done by Gergely Risko \<errge\> and Mihaly Barasz \<klao\>,
confirmed on two different machines with no other running processes.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 7.7 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | high |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"-ffull-laziness does more harm than good","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"7.8.1","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.7","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"In this bug report I'd like to argue that `-ffull-laziness` shouldn't\r\nbe turned on automatically with either `-O` nor `-O2`, because it's\r\ndangerous and can cause serious memory leaks which are hard to debug\r\nor prevent. I'll also try to show that its optimization benefits are\r\nnegligible. Actually, my benchmarks show that it's beneficial to turn\r\nit off even in the cases where we don't hit a space leak.\r\n\r\nWe've met this issue last week, but it had been reported several times\r\nbefore: e.g. #917 and #5262.\r\n\r\nA typical example is the following:\r\n{{{\r\n#!haskell\r\nmain :: IO ()\r\nmain = task () >> task ()\r\n\r\ntask :: () -> IO ()\r\ntask () = printvalues [1..1000000 :: Int]\r\n\r\nprintvalues :: [Int] -> IO ()\r\nprintvalues (x:xs) = print x >> printvalues xs\r\nprintvalues [] = return ()\r\n}}}\r\n\r\nWe succeed with `-O0`, but fail with `-O`:\r\n{{{\r\nerrge@curry:~/tmp $ ~/tmp/ghc/inplace/bin/ghc-stage2 -v0 -O0 -fforce-recomp lazy && ./lazy +RTS -t >/dev/null\r\n<<ghc: 1620098744 bytes, 3117 GCs, 32265/42580 avg/max bytes residency (3 samples), 2M in use, 0.00 INIT (0.00 elapsed), 1.28 MUT (1.28 elapsed), 0.02 GC (0.02 elapsed) :ghc>>\r\nerrge@curry:~/tmp $ ~/tmp/ghc/inplace/bin/ghc-stage2 -v0 -O -fforce-recomp lazy && ./lazy +RTS -t >/dev/null\r\n<<ghc: 1444098612 bytes, 2761 GCs, 3812497/13044272 avg/max bytes residency (7 samples), 28M in use, 0.00 INIT (0.00 elapsed), 1.02 MUT (1.03 elapsed), 0.12 GC (0.12 elapsed) :ghc>>\r\n}}}\r\n\r\n28M? What the leak!? Well, it's `-ffull-laziness`:\r\n{{{\r\nerrge@curry:~/tmp $ ~/tmp/ghc/inplace/bin/ghc-stage2 -v0 -O -fno-full-laziness -fforce-recomp lazy && ./lazy +RTS -t >/dev/null\r\n<<ghc: 1484098612 bytes, 2835 GCs, 34812/42580 avg/max bytes residency (2 samples), 1M in use, 0.00 INIT (0.00 elapsed), 1.04 MUT (1.04 elapsed), 0.02 GC (0.02 elapsed) :ghc>>\r\n}}}\r\n\r\nWe get constant space and the fastest run-time too, since we spare\r\nsome cycles on GC.\r\n\r\nNote, that in this instance we are trying to explicity disable sharing\r\nby using `()` as a fake argument for the function. Also note, that\r\nthis function may easily be a utility function in a larger code base\r\nor in a library, therefore it's impractical to say that you shouldn't\r\nuse it twice \"too close together\".\r\n\r\nQuoting from the GHC user guide:\r\n{{{\r\n -O2:\r\n\r\n Means: “Apply every non-dangerous optimisation, even if it means\r\n significantly longer compile times.”\r\n\r\n The avoided “dangerous” optimisations are those that can make\r\n runtime or space worse if you're unlucky. They are normally turned\r\n on or off individually.\r\n\r\n At the moment, -O2 is unlikely to produce better code than -O.\r\n}}}\r\n\r\nThis seems to be false at the moment.\r\n\r\nWe decided to make a broader investigation into this issue and wanted\r\nto know if we can disable this optimization without too much pain.\r\nCame up with this benchmark plan:\r\n\r\n - let's benchmark GHC,\r\n\r\n - compile all stages with -O, but hack the stage1 compiler to\r\n emit `-t` statistics for every file compiled,\r\n\r\n - gather these statistics while compiling the libraries and the\r\n stage2 compiler.\r\n\r\nOn the second run we compile the stage1 compiler with\r\n`-O -fno-full-laziness`, but leave everything else unchanged in the\r\nenvironment.\r\n\r\nWhen we have both results of the compilation of ~1600 files, we match\r\nthem up and compute the (logarithmic) ratio of CPU and memory\r\ndifference between compilations, the final results for our benchmark.\r\n\r\nThe results and the raw data can be found at\r\nhttps://github.com/errge/notlazy.\r\n\r\nThe overall compilation time dropped from 26:20 to 25:12, which is a\r\n4% improvement. Investigating the full matching shows that this\r\noverall result is from small improvements all around the place.\r\n\r\nThe results plotted:\r\n - https://github.com/errge/notlazy/blob/master/cpu.png\r\n - https://github.com/errge/notlazy/blob/master/mem.png\r\n\r\nThe graphs show the logarithmic (100*log_10(new/orig)) ratio of change\r\nin cpu and memory consumption. Therefore negative results mean that\r\nthe new compilation method is faster.\r\n\r\nAs can be seen on the CPU graph, in most of the cases the difference\r\nis negligible (actually smaller than what can be measured on small\r\nfiles, this is why we have the spike at 0). In overall we see a small\r\nimprovement in CPU, and there are some outliers in both directions,\r\nbut there are more drastic improvement cases than drastic regressions.\r\n\r\nOn the memory graph the situation is much more close to zero. There\r\nis one big positive memory outlier: `DsListComp.lhs`. It uses 69M\r\noriginally and now uses 103M. But compiles in 2 seconds both ways and\r\nthere are files in the source tree which requires 400M to compile, so\r\nthis is not an issue.\r\n\r\nAfter all this, I'd like to hear other opinions about just disabling\r\nthis optimization in `-O` and `-O2` and leaving it as an option that\r\ncan be turned on when needed, my reasons once more:\r\n - it's unsafe,\r\n - it's hard to debug when you hit its issues,\r\n - the optimization doesn't seem to be very productive,\r\n - it's always easy to force sharing, but it's not easy to force\r\n copying.\r\n\r\nApparently a Haskell programmer should be lazy, but never fully lazy. \r\n\r\nResearch done by Gergely Risko <errge> and Mihaly Barasz <klao>,\r\nconfirmed on two different machines with no other running processes.\r\n","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/7367float-out causes extra allocation2019-07-07T18:50:04Zwurmlifloat-out causes extra allocationThe Haskell fannkuchredux (included in nofib as fannkuch-redux) contribution of Louis Wasserman to "The Computer Language Benchmarks Game" at shootout.alioth.debian.org times out on the amd64 machines, but not on the i386. I can reproduc...The Haskell fannkuchredux (included in nofib as fannkuch-redux) contribution of Louis Wasserman to "The Computer Language Benchmarks Game" at shootout.alioth.debian.org times out on the amd64 machines, but not on the i386. I can reproduce it on my Debian amd64 machine.
It turns out that compiling without optimisation or with a simple -O produces a fast program, but with enormously large heap space allocated (10G compared with 67k on a virtual i386 machine) and also more garbage collector activity.
The source is below (because I don't find a way to attach the file). At the end of the source I copied my make command line, run command line and output produced with option -sstderr.
---------------------
```
{- The Computer Language Benchmarks Game
http://shootout.alioth.debian.org/
contributed by Louis Wasserman
This should be compiled with:
-threaded -O2 -fexcess-precision -fasm
and run with:
+RTS -N<number of cores> -RTS <input>
-}
import Control.Concurrent
import Control.Monad
import System.Environment
import Foreign hiding (rotate)
import Data.Monoid
type Perm = Ptr Word8
data F = F {-# UNPACK #-} !Int {-# UNPACK #-} !Int
instance Monoid F where
mempty = F 0 0
F s1 m1 `mappend` F s2 m2 = F (s1 + s2) (max m1 m2)
incPtr = (`advancePtr` 1)
decPtr = (`advancePtr` (-1))
flop :: Int -> Perm -> IO ()
flop k xs = flopp xs (xs `advancePtr` k)
where flopp i j = when (i < j) $ swap i j >> flopp (incPtr i) (decPtr j)
swap i j = do
a <- peek i
b <- peek j
poke j a
poke i b
flopS :: Perm -> (Int -> IO a) -> IO a
flopS !xs f = do
let go !acc = do
k <- peekElemOff xs 0
if k == 0 then f acc else flop (fromIntegral k) xs >> go (acc+1)
go 0
increment :: Ptr Word8 -> Ptr Word8 -> IO ()
increment !p !ct = do
first <- peekElemOff p 1
pokeElemOff p 1 =<< peekElemOff p 0
pokeElemOff p 0 first
let go !i !first = do
ci <- peekElemOff ct i
if fromIntegral ci < i then pokeElemOff ct i (ci+1) else do
pokeElemOff ct i 0
let !i' = i + 1
moveArray p (incPtr p) i'
pokeElemOff p i' first
go i' =<< peekElemOff p 0
go 1 first
genPermutations :: Int -> Int -> Int -> Ptr Word8 -> Ptr Word8 -> IO F
genPermutations !n !l !r !perm !count = allocaArray n $ \ destF -> do
let upd j !f run = do
p0 <- peekElemOff perm 0
if p0 == 0 then increment perm count >> run f else do
copyArray destF perm n
increment perm count
flopS destF $ \ flops ->
run (f `mappend` F (checksum j flops) flops)
let go j !f = if j >= r then return f else upd j f (go (j+1))
go l mempty
where checksum i f = if i .&. 1 == 0 then f else -f
facts :: [Int]
facts = scanl (*) 1 [1..12]
unrank :: Int -> Int -> (Ptr Word8 -> Ptr Word8 -> IO a) -> IO a
unrank !idx !n f = allocaArray n $ \ p -> allocaArray n $ \ count ->
allocaArray n $ \ pp -> do
mapM_ (\ i -> pokeElemOff p i (fromIntegral i)) [0..n-1]
let go i !idx = when (i >= 0) $ do
let fi = facts !! i
let (q, r) = idx `quotRem` fi
pokeElemOff count i (fromIntegral q)
copyArray pp p (i+1)
let go' j = when (j <= i) $ do
let jq = j + q
pokeElemOff p j =<< peekElemOff pp (if jq <= i then jq else jq - i - 1)
go' (j+1)
go' 0
go (i-1) r
go (n-1) idx
f p count
main = do
n <- fmap (read.head) getArgs
let fact = product [1..n]
let bk = fact `quot` 4
vars <- forM [0,bk..fact-1] $ \ ix -> do
var <- newEmptyMVar
forkIO (unrank ix n $ \ p -> genPermutations n ix (min fact (ix + bk)) p >=> putMVar var)
return var
F chksm mflops <- liftM mconcat (mapM takeMVar vars)
putStrLn $ (show chksm) ++ "\nPfannkuchen(" ++ (show n) ++ ") = " ++ (show $ mflops)
{-
wurmli@noah-nofen:~/hpw/haskell/work/fannkuch$ ghc --make
-XBangPatterns -O -threaded -fllvm -rtsopts fannkuchredux.ghc-3.hs
[1 of 1] Compiling Main ( fannkuchredux.ghc-3.hs, fannkuchredux.ghc-3.o )
Linking fannkuchredux.ghc-3 ...
wurmli@noah-nofen:~/hpw/haskell/work/fannkuch$ ./fannkuchredux.ghc-3 +RTS -N4 -sstderr -RTS 12
3968050
Pfannkuchen(12) = 65
10,538,122,952 bytes allocated in the heap
359,512 bytes copied during GC
47,184 bytes maximum residency (2 sample(s))
51,120 bytes maximum slop
4 MB total memory in use (1 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6053 colls, 6053 par 0.16s 0.04s 0.0000s 0.0001s
Gen 1 2 colls, 1 par 0.00s 0.00s 0.0001s 0.0001s
Parallel GC work balance: 40.82% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 44.73s ( 11.51s elapsed)
GC time 0.16s ( 0.04s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 44.89s ( 11.55s elapsed)
Alloc rate 235,589,887 bytes per MUT second
Productivity 99.6% of total user, 387.3% of total elapsed
gc_alloc_block_sync: 31024
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
-}
```Research needed