SpecConstr rule base woes
Consider the following stream fusion approach harnessing the type class specialiser (and also constructor-pattern specialisation for subsequently specialising loop
):
{-# LANGUAGE TypeFamilies, FlexibleContexts, BangPatterns #-}
data Step iter
= Yield !iter !(Item iter)
| Skip !iter
| Done
class Iterator iter where
type Item iter
next :: iter -> Step iter
data Singleton a = Empty' | Singleton !a
singletonS :: a -> Singleton a
singletonS = Singleton
instance Iterator (Singleton a) where
type Item (Singleton a) = a
next Empty' = Done
next (Singleton a) = Yield Empty' a
data EnumFromTo a = EnumFromTo !a !a
enumFromToS :: a -> a -> EnumFromTo a
enumFromToS = EnumFromTo
instance (Ord a, Num a) => Iterator (EnumFromTo a) where
type Item (EnumFromTo a) = a
next (EnumFromTo i high)
| i > high = Done
| otherwise = Yield (EnumFromTo (i + 1) high) i
sumS :: (Num (Item iter), Iterator iter) => iter -> Item iter
sumS = loop 0
where
loop !total iter1 =
case next iter1 of
Done -> total
Skip iter2 -> loop total iter2
Yield iter2 x -> loop (total + x) iter2
-- {-# NOINLINE[99] loop #-} -- comment in for 10x speedup
{-# INLINE sumS #-}
data ConcatMap i o = ConcatMap !(Maybe o) !(Item i -> o) !i
concatMapS :: (Item i -> o) -> i -> ConcatMap i o
concatMapS = ConcatMap Nothing
instance (Iterator i, Iterator o) => Iterator (ConcatMap i o) where
type Item (ConcatMap i o) = Item o
next (ConcatMap Nothing f i) = case next i of
Skip i' -> Skip (ConcatMap Nothing f i')
Yield i' x -> Skip (ConcatMap (Just (f x)) f i')
Done -> Done
next (ConcatMap (Just o) f i) = case next o of
Skip o' -> Skip (ConcatMap (Just o') f i)
Yield o' x -> Yield (ConcatMap (Just o') f i) x
Done -> Skip (ConcatMap Nothing f i)
ex1 :: Int -> Int
ex1
= sumS
. concatMapS (\a -> singletonS (a*a))
. enumFromToS 1
{-# NOINLINE ex1 #-}
main = print (ex1 99999999)
If the {-# NOINLINE[99] loop #-}
pragma is commented in, ex1
will compile down to this perfect bit of code:
Main.main_$s$wloop
= \ (sc_s4GJ :: GHC.Prim.Int#)
(sc1_s4GI :: GHC.Prim.Int#)
(sc2_s4GH :: GHC.Prim.Int#) ->
case GHC.Prim.># sc1_s4GI sc_s4GJ of {
__DEFAULT ->
Main.main_$s$wloop
sc_s4GJ
(GHC.Prim.+# sc1_s4GI 1#)
(GHC.Prim.+# sc2_s4GH (GHC.Prim.*# sc1_s4GI sc1_s4GI));
1# -> sc2_s4GH
}
Beautiful!
Without the NOINLINE
pragma, not so much, though. Here's the Core:
Main.main_$s$wloop [Occ=LoopBreaker]
:: GHC.Prim.Int#
-> GHC.Prim.Int#
-> (Item (EnumFromTo Int) -> Singleton Int)
-> Singleton Int
-> GHC.Prim.Int#
-> GHC.Prim.Int#
[GblId,
Arity=5,
Caf=NoCafRefs,
Str=<S,U><S,U><L,C(U)><S,1*U><S,U>,
Unf=OtherCon []]
Main.main_$s$wloop
= \ (sc_s4FQ :: GHC.Prim.Int#)
(sc1_s4FP :: GHC.Prim.Int#)
(sc2_s4FO :: Item (EnumFromTo Int) -> Singleton Int)
(sc3_s4FN :: Singleton Int)
(sc4_s4FM :: GHC.Prim.Int#) ->
case sc3_s4FN of {
Empty' ->
case GHC.Prim.># sc1_s4FP sc_s4FQ of {
__DEFAULT ->
Main.main_$s$wloop
sc_s4FQ
(GHC.Prim.+# sc1_s4FP 1#)
sc2_s4FO
(sc2_s4FO
((GHC.Types.I# sc1_s4FP)
`cast` (Sub (Sym (Main.D:R:ItemEnumFromTo[0] <Int>_N))
:: Int ~R# Item (EnumFromTo Int))))
sc4_s4FM;
1# -> sc4_s4FM
};
Singleton a_a1z4 ->
case a_a1z4 of { GHC.Types.I# y_s4Dg ->
Main.main_$s$wloop
sc_s4FQ
sc1_s4FP
sc2_s4FO
(Main.Empty' @ Int)
(GHC.Prim.+# sc4_s4FM y_s4Dg)
}
}
Long story short, constructor specialisation failed to specialise $wloop
for some higher-order arguments (the function of type Item (EnumFromTo Int) -> Singleton Int
in particular). That results in a slow-down of a factor of 10 and it goes from constant space to allocating huge amounts of memory (still constant residency, though).
Why?
Edit: Don't bother reading this section yet, it's incoherent and inaccurate. Suffice it to say, the conclusion is accurate: Even though we specialise for the stronger call pattern, we don't rewrite to it because the rewrite rule is not attached to the strictly weaker specialisation.
Looking at the difference in -dverbose-core2core
, the first real divergence is in the output from desugaring after optimisation, where the inner loop
is specialised for the type arguments and dictionaries of the outer sumS
function in the version without NOINLINE
.
Then after the initial simplifier phase we get the first digression because loop
was specialised for the particular types it was called at, whereas it wasn't in the NOINLINE
version (which makes sense).
Now, the critical moment is when loop
becomes too large to inline into ex1
in the version without NOINLINE
. But ex1
calls loop
with quite interesting arguments, those we want to specialise for (the particular ConcatMap
datum in particular)! In code:
f :: Int -> Singleton Int -- same function for both
f ...
inner :: Int -> EnumFromToInt Int -- the input stream from which ConcatMap draws elements
inner ...
-- without NOINLINE
loop (ConcatMap x f inner) = HUGE and recursive
ex1 n = loop (ConcatMap Nothing f (inner n))
-- with NOINLINE[99], loop gets inlined after the initial phase
ex1 n = joinrec loop' (ConcatMap x f (inner n)) = <HUGE and recursive> in loop' (ConcatMap Nothing f (inner n))
Then WW kicks in. loop
(and loop'
) is deeply strict in the ConcatMap
we pass it, but it refuses to 'unpack' (i.e. specialise) for ConcatMap
's f
. Note also that it will WW loop
, ex1
and loop'
:
-- without NOINLINE
$wloop x f <unpacked components from inner> = HUGE and recursive
loop (ConcatMap x f (.. unpack components from inner ...)) = $wloop x f <unpacked components from inner>
ex1 n = $wloop (ConcatMap Nothing f (inner n))
-- with NOINLINE
$wex1 n = joinrec $wloop' (ConcatMap x f inner) = <HUGE and recursive> in $wloop' (ConcatMap Nothing f (inner n))
ex1 (I# n) = $wex1 n
Note how in the NOINLINE case this separated the definition of f
from its usage by a lambda binder, so the subsequent SpecConstr pass is unable to figure out what to specialise for. It can't look through $fw
. I think this is the same or at least a similar effect that @nomeata experienced in #14068 and accounted for in #14844.
Or at least that's my working hypothesis, I'll edit this post accordingly tomorrow when I had time to investigate.