Need better absence analysis

Here's an example where absence analysis is really failing.

data K a = Nil | Kons !a !(K a)
data T = MkT ![Int] !(K Int)

f1 :: T -> Int -> [Int]
--  Str=<1!P(1L,A)><1!P(1L)>,
f1 (MkT xs ys) 0 = xs
f1 (MkT xs ys) n = f1 (MkT xs (Kons n ys)) (n-1)

f2 :: [Int] -> K Int -> Int -> [Int]
-- The bang makes things much worse
--  Str=<1L><1A><1!P(1L)>,
f2 xs !ys 0 = xs
f2 xs ys n = f2 xs (Kons n ys) (n-1)

In f1, the ys argument to MkT is never used. Absence analysis discovers that, and we get a tight loop tha never passes that accumulating Kons list.

Foo.$wf1 [InlPrag=[2], Occ=LoopBreaker]
  :: [Int] -> GHC.Prim.Int# -> [Int]
[GblId[StrictWorker([!, ~])],
 Arity=2,
 Str=<1L><1L>,
 Unf=OtherCon []]
Foo.$wf1
  = \ (ww_sMw :: [Int])
      (ww1_sMB :: GHC.Prim.Int#) ->
      case ww1_sMB of ds_X3 {
        __DEFAULT -> Foo.$wf1 ww_sMw (GHC.Prim.-# ds_X3 1#);
        0# -> ww_sMw
      }

But in f2 I have done a kind of manual worker/wrapper and unboxed the MkT. I have also added a bng to ys. I get much worse code:

Foo.$wf2 [InlPrag=[2], Occ=LoopBreaker]
  :: [Int] -> K Int -> GHC.Prim.Int# -> [Int]
[GblId[StrictWorker([!, ~, ~])],
 Arity=3,
 Str=<1L><MA><1L>,
 Unf=OtherCon []]
Foo.$wf2
  = \ (xs_sMl :: [Int])
      (ys_sMm :: K Int)
      (ww_sMp :: GHC.Prim.Int#) ->
      case ww_sMp of ds_X2 {
        __DEFAULT ->
          case ys_sMm of ys1_X3 { __DEFAULT ->
          Foo.$wf2
            xs_sMl
            (Foo.Kons @Int (GHC.Types.I# ds_X2) ys1_X3)
            (GHC.Prim.-# ds_X2 1#)
          };
        0# -> xs_sMl
      }

Look at that ever-increasing Kons argument that is eventually thrown away anyway.

How this happened

This happened in the wild: see here. The observed behavior is:

With a SPECIALISE pragama in module MnistFcnnScalar.hs, program is fast. Indirectly we create a specialised version of overloaded function sumConstantData in that module.
Remove the SPECIALISE, add -fexpose-all-unfoldings -fspecialise-aggressively. Now module BenchMnistTools.hs needs a specialised version of sumConstantData, and creates one from the exposed unfolding.
But program runs 30% slower, and allocates 15x as much. Why?

Turns out to be that the optimised unfolding of the (overloaded) sumConstantData, in MnistFcnnScalar.hi, has a local loop $wfoldlM'_loop which arose from this defn in vector:Data.Vector.Fusion.Stream.Monadic:

foldlM' :: Monad m => (a -> b -> m a) -> a -> Stream m b -> m a
{-# INLINE_FUSED foldlM' #-}
foldlM' m w (Stream step t) = foldlM'_loop SPEC w t
  where
    foldlM'_loop !_ z s
      = z `seq`
        do
          r <- step s
          case r of
            Yield x s' -> do { z' <- m z x; foldlM'_loop SPEC z' s' }
            Skip    s' -> foldlM'_loop SPEC z s'
            Done       -> return z

Notice that seq!

When we specialise the (optimised) unfolding for sumConstantData we are seeing the loop that looks much like f2.
But when we specialise the un-optimised original defn of sumConstantData (when the SPECIALISE pragma is there) we have not yet inlined foldlM' so we don't see that loop.

So roughly, we get f2 instead of f1, and that makes a reallyl big difference. What is worse, the difference happens when we see a "more optimised" program. The programmer has zero chance of working out what is going on.

Edited Dec 19, 2023 by sheaf

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information