Need better absence analysis
Here's an example where absence analysis is really failing.
data K a = Nil | Kons !a !(K a)
data T = MkT ![Int] !(K Int)
f1 :: T -> Int -> [Int]
-- Str=<1!P(1L,A)><1!P(1L)>,
f1 (MkT xs ys) 0 = xs
f1 (MkT xs ys) n = f1 (MkT xs (Kons n ys)) (n-1)
f2 :: [Int] -> K Int -> Int -> [Int]
-- The bang makes things much worse
-- Str=<1L><1A><1!P(1L)>,
f2 xs !ys 0 = xs
f2 xs ys n = f2 xs (Kons n ys) (n-1)
In f1, the ys argument to MkT is never used. Absence analysis discovers that,
and we get a tight loop tha never passes that accumulating Kons list.
Foo.$wf1 [InlPrag=[2], Occ=LoopBreaker]
:: [Int] -> GHC.Prim.Int# -> [Int]
[GblId[StrictWorker([!, ~])],
Arity=2,
Str=<1L><1L>,
Unf=OtherCon []]
Foo.$wf1
= \ (ww_sMw :: [Int])
(ww1_sMB :: GHC.Prim.Int#) ->
case ww1_sMB of ds_X3 {
__DEFAULT -> Foo.$wf1 ww_sMw (GHC.Prim.-# ds_X3 1#);
0# -> ww_sMw
}
But in f2 I have done a kind of manual worker/wrapper and unboxed the MkT. I have also added a bng to ys. I get much worse code:
Foo.$wf2 [InlPrag=[2], Occ=LoopBreaker]
:: [Int] -> K Int -> GHC.Prim.Int# -> [Int]
[GblId[StrictWorker([!, ~, ~])],
Arity=3,
Str=<1L><MA><1L>,
Unf=OtherCon []]
Foo.$wf2
= \ (xs_sMl :: [Int])
(ys_sMm :: K Int)
(ww_sMp :: GHC.Prim.Int#) ->
case ww_sMp of ds_X2 {
__DEFAULT ->
case ys_sMm of ys1_X3 { __DEFAULT ->
Foo.$wf2
xs_sMl
(Foo.Kons @Int (GHC.Types.I# ds_X2) ys1_X3)
(GHC.Prim.-# ds_X2 1#)
};
0# -> xs_sMl
}
Look at that ever-increasing Kons argument that is eventually thrown away anyway.
How this happened
This happened in the wild: see here. The observed behavior is:
- With a SPECIALISE pragama in module
MnistFcnnScalar.hs, program is fast. Indirectly we create a specialised version of overloaded functionsumConstantDatain that module. - Remove the SPECIALISE, add
-fexpose-all-unfoldings -fspecialise-aggressively. Now moduleBenchMnistTools.hsneeds a specialised version ofsumConstantData, and creates one from the exposed unfolding. - But program runs 30% slower, and allocates 15x as much. Why?
- Turns out to be that the optimised unfolding of the (overloaded)
sumConstantData, inMnistFcnnScalar.hi, has a local loop$wfoldlM'_loopwhich arose from this defn invector:Data.Vector.Fusion.Stream.Monadic:Notice thatfoldlM' :: Monad m => (a -> b -> m a) -> a -> Stream m b -> m a {-# INLINE_FUSED foldlM' #-} foldlM' m w (Stream step t) = foldlM'_loop SPEC w t where foldlM'_loop !_ z s = z `seq` do r <- step s case r of Yield x s' -> do { z' <- m z x; foldlM'_loop SPEC z' s' } Skip s' -> foldlM'_loop SPEC z s' Done -> return zseq! - When we specialise the (optimised) unfolding for
sumConstantDatawe are seeing the loop that looks much likef2. - But when we specialise the un-optimised original defn of
sumConstantData(when the SPECIALISE pragma is there) we have not yet inlinedfoldlM'so we don't see that loop.
So roughly, we get f2 instead of f1, and that makes a reallyl big difference. What is worse,
the difference happens when we see a "more optimised" program. The programmer has zero chance of working out what is going on.
Edited by sheaf