The Simplifier's deferring of evals can cause space leaks

Problem

I trying to prove a point in #21497 (comment 436914), I came up with the following program and decided to continue the discussion in a separate thread

{-# LANGUAGE BangPatterns #-}

import System.Environment

f :: [Int] -> Int
f xs = g (length xs) (even $ mySum xs)
{-# NOINLINE f #-}

g :: Int -> Bool -> Int
g 0 _ = 0
g n !b = length xs + mySum xs + if b then 0 else 1
  where
    xs = [0..n]
{-# NOINLINE g #-}

mySum :: [Int] -> Int
mySum = go 0
  where
    go acc (x:xs) = go (x+acc) xs
    go acc _      = acc

main = do
  (n:_) <- map read <$> getArgs
  print $ f [0..n]

The key here is

length forces the spine of xs, potentially a very big list
The thunk sum xs retains that list
This pattern occurs twice in the program, once in f and once in g.
g has a bang on b, so the programmer really wants to force the even $ mySum xs thunk before doing the next long and allocation-heavy length and then mySum operation, so that at most one big list is live at any point

But the Simplifier (on master as well as 9.0) discards the seq because the if will eval it later on, so we'll get overlapping lifetimes for both big lists (keep an eye out for ds_s2TY):

-- RHS size: {terms: 49, types: 14, coercions: 0, joins: 0/1}
$wg_r2UH :: GHC.Prim.Int# -> Bool -> GHC.Prim.Int#
[GblId, Arity=2, Str=<1L><ML>, Unf=OtherCon []]
$wg_r2UH
  = \ (ww_s2TW :: GHC.Prim.Int#) (ds_s2TY :: Bool) ->
      case ww_s2TW of ds1_X1 {
        __DEFAULT ->
          case GHC.Prim.># 0# ds1_X1 of {
            __DEFAULT ->
              letrec {
                go3_a2T6 [Occ=LoopBreaker, Dmd=SCS(L)] :: GHC.Prim.Int# -> [Int]
                [LclId, Arity=1, Str=<L>, Unf=OtherCon []]
                go3_a2T6 = ... in
              case $wgo_r2UG 0# (go3_a2T6 0#) of ww1_s2Ua { __DEFAULT ->
              case ds_s2TY of {
                False -> GHC.Prim.+# ww1_s2Ua 1#;
                True -> ww1_s2Ua
              }
              };
            1# ->
              case ds_s2TY of {
                False -> 1#;
                True -> 0#
              }
          };
        0# -> 0#
      }

So every GC has to copy over the the retained list. Here's an example invocation:

./simpl 10000000 +RTS -s
50000025000003
   1,440,058,976 bytes allocated in the heap
           8,368 bytes copied during GC
     497,755,080 bytes maximum residency (9 sample(s))
      85,244,984 bytes maximum slop
            1133 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       340 colls,     0 par    0.386s   0.386s     0.0011s    0.0038s
  Gen  1         9 colls,     0 par    0.903s   0.903s     0.1004s    0.3653s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.198s  (  0.196s elapsed)
  GC      time    1.289s  (  1.289s elapsed)
  EXIT    time    0.000s  (  0.004s elapsed)
  Total   time    1.488s  (  1.490s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    7,255,512,961 bytes per MUT second

  Productivity  13.3% of total user, 13.2% of total elapsed

As you can see, the majority of time is spent in GC, major GCs in particular. The latter are necessary to copy over the retained list. Note also almost 500MB of max residency.

Btw., I think our "bytes copied during GC" counter might be broken. Hence I'll continue to show the output of GHC 9.0, where we'll get

./simpl 10000000 +RTS -s
50000025000003
   1,440,058,840 bytes allocated in the heap
   2,370,752,112 bytes copied during GC
     660,038,856 bytes maximum residency (12 sample(s))
     113,040,184 bytes maximum slop
            1499 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1384 colls,     0 par    0.398s   0.399s     0.0003s    0.0009s
  Gen  1        12 colls,     0 par    1.145s   1.145s     0.0954s    0.4550s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.210s  (  0.209s elapsed)
  GC      time    1.543s  (  1.544s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    1.753s  (  1.753s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    6,868,178,975 bytes per MUT second

  Productivity  12.0% of total user, 11.9% of total elapsed

Note the excessive copied during GC number, which seems much more reasonable and accounts for the intense GC pressure.

If I put a single bang on g 0 !_ = 0, I trigger call-by-value (in CorePrep) in f and instead get the desired

./simpl 10000000 +RTS -s
50000025000003
   1,440,058,816 bytes allocated in the heap
   1,970,756,096 bytes copied during GC
     385,938,608 bytes maximum residency (12 sample(s))
      66,104,144 bytes maximum slop
            1045 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1384 colls,     0 par    0.401s   0.402s     0.0003s    0.0009s
  Gen  1        12 colls,     0 par    0.841s   0.841s     0.0701s    0.2717s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.177s  (  0.176s elapsed)
  GC      time    1.242s  (  1.243s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    1.419s  (  1.419s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    8,150,089,956 bytes per MUT second

  Productivity  12.5% of total user, 12.4% of total elapsed

Note that residency almost halved. The same is reproducible on HEAD, where residency goes down to about 300MB.

Diagnosis

It's pretty clear to me that we should not delay the eval of b in the original, lazy g: A Bool has constant space whereas the thunk even $ sum xs retains a list which clearly takes non-constant/input-dependent space. Even without knowing what thunk we evaluate, it can't result in a smaller residency than a simple Bool.

That's what I mean with

I mean, it is probably the very reason why inserting apparently redundant evals is so beneficial for !8148 (closed)!

Because tag inference puts some of the evals back into the wrapper of $wgo (especially if the argument is trivial, which is not the case here, but I hope you see that I could arrange for it to trigger).

Of course you can find examples where we'd retain a list if we eval early but retain only an O(1) thunk when we delay. But in this example there are two independent, good reasons not to delay the eval:

The programmer put a bang on b
b has (non-recursive) type Bool

I think this is somewhat similar to the debate of "which results/parameters are worth unboxing" which we concluded with "don't unbox recursive DataCons" (#11545 (closed)) for now. I think it makes sense to do the same for evals. We could experiment around with that in Note [Speculative evaluation] and unliftedness worker/wrapper.

Proposal

So let's try not to drop evals on non-recursive data types. We could also try to drop evals only if they happen immediately before a call to a strict function, because these evals get re-inserted back in CorePrep.

Edited Jan 23, 2024 by Simon Peyton Jones

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information