The Simplifier's deferring of evals can cause space leaks
See also: #24251 (closed)
Problem
I trying to prove a point in #21497 (comment 436914), I came up with the following program and decided to continue the discussion in a separate thread
{-# LANGUAGE BangPatterns #-}
import System.Environment
f :: [Int] -> Int
f xs = g (length xs) (even $ mySum xs)
{-# NOINLINE f #-}
g :: Int -> Bool -> Int
g 0 _ = 0
g n !b = length xs + mySum xs + if b then 0 else 1
where
xs = [0..n]
{-# NOINLINE g #-}
mySum :: [Int] -> Int
mySum = go 0
where
go acc (x:xs) = go (x+acc) xs
go acc _ = acc
main = do
(n:_) <- map read <$> getArgs
print $ f [0..n]
The key here is
-
length
forces the spine ofxs
, potentially a very big list - The thunk
sum xs
retains that list - This pattern occurs twice in the program, once in
f
and once ing
. -
g
has a bang onb
, so the programmer really wants to force theeven $ mySum xs
thunk before doing the next long and allocation-heavylength
and thenmySum
operation, so that at most one big list is live at any point
But the Simplifier (on master as well as 9.0) discards the seq because the if
will eval it later on, so we'll get overlapping lifetimes for both big lists (keep an eye out for ds_s2TY
):
-- RHS size: {terms: 49, types: 14, coercions: 0, joins: 0/1}
$wg_r2UH :: GHC.Prim.Int# -> Bool -> GHC.Prim.Int#
[GblId, Arity=2, Str=<1L><ML>, Unf=OtherCon []]
$wg_r2UH
= \ (ww_s2TW :: GHC.Prim.Int#) (ds_s2TY :: Bool) ->
case ww_s2TW of ds1_X1 {
__DEFAULT ->
case GHC.Prim.># 0# ds1_X1 of {
__DEFAULT ->
letrec {
go3_a2T6 [Occ=LoopBreaker, Dmd=SCS(L)] :: GHC.Prim.Int# -> [Int]
[LclId, Arity=1, Str=<L>, Unf=OtherCon []]
go3_a2T6 = ... in
case $wgo_r2UG 0# (go3_a2T6 0#) of ww1_s2Ua { __DEFAULT ->
case ds_s2TY of {
False -> GHC.Prim.+# ww1_s2Ua 1#;
True -> ww1_s2Ua
}
};
1# ->
case ds_s2TY of {
False -> 1#;
True -> 0#
}
};
0# -> 0#
}
So every GC has to copy over the the retained list. Here's an example invocation:
./simpl 10000000 +RTS -s
50000025000003
1,440,058,976 bytes allocated in the heap
8,368 bytes copied during GC
497,755,080 bytes maximum residency (9 sample(s))
85,244,984 bytes maximum slop
1133 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 340 colls, 0 par 0.386s 0.386s 0.0011s 0.0038s
Gen 1 9 colls, 0 par 0.903s 0.903s 0.1004s 0.3653s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.198s ( 0.196s elapsed)
GC time 1.289s ( 1.289s elapsed)
EXIT time 0.000s ( 0.004s elapsed)
Total time 1.488s ( 1.490s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 7,255,512,961 bytes per MUT second
Productivity 13.3% of total user, 13.2% of total elapsed
As you can see, the majority of time is spent in GC, major GCs in particular. The latter are necessary to copy over the retained list. Note also almost 500MB of max residency.
Btw., I think our "bytes copied during GC" counter might be broken. Hence I'll continue to show the output of GHC 9.0, where we'll get
./simpl 10000000 +RTS -s
50000025000003
1,440,058,840 bytes allocated in the heap
2,370,752,112 bytes copied during GC
660,038,856 bytes maximum residency (12 sample(s))
113,040,184 bytes maximum slop
1499 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1384 colls, 0 par 0.398s 0.399s 0.0003s 0.0009s
Gen 1 12 colls, 0 par 1.145s 1.145s 0.0954s 0.4550s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.210s ( 0.209s elapsed)
GC time 1.543s ( 1.544s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 1.753s ( 1.753s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 6,868,178,975 bytes per MUT second
Productivity 12.0% of total user, 11.9% of total elapsed
Note the excessive copied during GC number, which seems much more reasonable and accounts for the intense GC pressure.
If I put a single bang on g 0 !_ = 0
, I trigger call-by-value (in CorePrep) in f
and instead get the desired
./simpl 10000000 +RTS -s
50000025000003
1,440,058,816 bytes allocated in the heap
1,970,756,096 bytes copied during GC
385,938,608 bytes maximum residency (12 sample(s))
66,104,144 bytes maximum slop
1045 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1384 colls, 0 par 0.401s 0.402s 0.0003s 0.0009s
Gen 1 12 colls, 0 par 0.841s 0.841s 0.0701s 0.2717s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.177s ( 0.176s elapsed)
GC time 1.242s ( 1.243s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 1.419s ( 1.419s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 8,150,089,956 bytes per MUT second
Productivity 12.5% of total user, 12.4% of total elapsed
Note that residency almost halved. The same is reproducible on HEAD, where residency goes down to about 300MB.
Diagnosis
It's pretty clear to me that we should not delay the eval of b
in the original, lazy g
: A Bool
has constant space whereas the thunk even $ sum xs
retains a list which clearly takes non-constant/input-dependent space. Even without knowing what thunk we evaluate, it can't result in a smaller residency than a simple Bool
.
That's what I mean with
I mean, it is probably the very reason why inserting apparently redundant evals is so beneficial for !8148 (closed)!
Because tag inference puts some of the evals back into the wrapper of $wgo
(especially if the argument is trivial, which is not the case here, but I hope you see that I could arrange for it to trigger).
Of course you can find examples where we'd retain a list if we eval early but retain only an O(1) thunk when we delay. But in this example there are two independent, good reasons not to delay the eval:
- The programmer put a bang on
b
-
b
has (non-recursive) typeBool
I think this is somewhat similar to the debate of "which results/parameters are worth unboxing" which we concluded with "don't unbox recursive DataCons" (#11545 (closed)) for now. I think it makes sense to do the same for evals. We could experiment around with that in Note [Speculative evaluation]
and unliftedness worker/wrapper.
Proposal
So let's try not to drop evals on non-recursive data types. We could also try to drop evals only if they happen immediately before a call to a strict function, because these evals get re-inserted back in CorePrep.