Need better absence analysis
Here's an example where absence analysis is really failing.
data K a = Nil | Kons !a !(K a)
data T = MkT ![Int] !(K Int)
f1 :: T -> Int -> [Int]
-- Str=<1!P(1L,A)><1!P(1L)>,
f1 (MkT xs ys) 0 = xs
f1 (MkT xs ys) n = f1 (MkT xs (Kons n ys)) (n-1)
f2 :: [Int] -> K Int -> Int -> [Int]
-- The bang makes things much worse
-- Str=<1L><1A><1!P(1L)>,
f2 xs !ys 0 = xs
f2 xs ys n = f2 xs (Kons n ys) (n-1)
In f1
, the ys
argument to MkT
is never used. Absence analysis discovers that,
and we get a tight loop tha never passes that accumulating Kons
list.
Foo.$wf1 [InlPrag=[2], Occ=LoopBreaker]
:: [Int] -> GHC.Prim.Int# -> [Int]
[GblId[StrictWorker([!, ~])],
Arity=2,
Str=<1L><1L>,
Unf=OtherCon []]
Foo.$wf1
= \ (ww_sMw :: [Int])
(ww1_sMB :: GHC.Prim.Int#) ->
case ww1_sMB of ds_X3 {
__DEFAULT -> Foo.$wf1 ww_sMw (GHC.Prim.-# ds_X3 1#);
0# -> ww_sMw
}
But in f2
I have done a kind of manual worker/wrapper and unboxed the MkT
. I have also added a bng to ys
. I get much worse code:
Foo.$wf2 [InlPrag=[2], Occ=LoopBreaker]
:: [Int] -> K Int -> GHC.Prim.Int# -> [Int]
[GblId[StrictWorker([!, ~, ~])],
Arity=3,
Str=<1L><MA><1L>,
Unf=OtherCon []]
Foo.$wf2
= \ (xs_sMl :: [Int])
(ys_sMm :: K Int)
(ww_sMp :: GHC.Prim.Int#) ->
case ww_sMp of ds_X2 {
__DEFAULT ->
case ys_sMm of ys1_X3 { __DEFAULT ->
Foo.$wf2
xs_sMl
(Foo.Kons @Int (GHC.Types.I# ds_X2) ys1_X3)
(GHC.Prim.-# ds_X2 1#)
};
0# -> xs_sMl
}
Look at that ever-increasing Kons
argument that is eventually thrown away anyway.
How this happened
This happened in the wild: see here. The observed behavior is:
- With a SPECIALISE pragama in module
MnistFcnnScalar.hs
, program is fast. Indirectly we create a specialised version of overloaded functionsumConstantData
in that module. - Remove the SPECIALISE, add
-fexpose-all-unfoldings -fspecialise-aggressively
. Now moduleBenchMnistTools.hs
needs a specialised version ofsumConstantData
, and creates one from the exposed unfolding. - But program runs 30% slower, and allocates 15x as much. Why?
- Turns out to be that the optimised unfolding of the (overloaded)
sumConstantData
, inMnistFcnnScalar.hi
, has a local loop$wfoldlM'_loop
which arose from this defn invector:Data.Vector.Fusion.Stream.Monadic
:foldlM' :: Monad m => (a -> b -> m a) -> a -> Stream m b -> m a {-# INLINE_FUSED foldlM' #-} foldlM' m w (Stream step t) = foldlM'_loop SPEC w t where foldlM'_loop !_ z s = z `seq` do r <- step s case r of Yield x s' -> do { z' <- m z x; foldlM'_loop SPEC z' s' } Skip s' -> foldlM'_loop SPEC z s' Done -> return z
seq
! - When we specialise the (optimised) unfolding for
sumConstantData
we are seeing the loop that looks much likef2
. - But when we specialise the un-optimised original defn of
sumConstantData
(when the SPECIALISE pragma is there) we have not yet inlinedfoldlM'
so we don't see that loop.
So roughly, we get f2
instead of f1
, and that makes a reallyl big difference. What is worse,
the difference happens when we see a "more optimised" program. The programmer has zero chance of working out what is going on.