Bad monads performance
Hi! We've been struggling with a very strange GHC behavior on IRC today. Let's consider the following code (needs mtl and criterion to be compiled):
module Main where
import Prelude
import Criterion.Main
import qualified Control.Monad.State.Strict as Strict
import qualified Control.Monad.State.Class as State
import Control.DeepSeq (NFData, rnf, force)
import GHC.IO (evaluate)
import Data.Monoid
-----------------------------
-- === Criterion utils === --
-----------------------------
eval :: NFData a => a -> IO a
eval = evaluate . force ; {-# INLINE eval #-}
liftExp :: (Int -> a) -> (Int -> a)
liftExp f = f . (10^) ; {-# INLINE liftExp #-}
expCodeGen :: NFData a => (Int -> a) -> (Int -> IO a)
expCodeGen f i = do
putStrLn $ "generating input code (10e" <> show i <> " chars)"
out <- eval $ liftExp f i
putStrLn "code generated sucessfully"
return out
{-# INLINE expCodeGen #-}
expCodeGenBench :: (NFData a, NFData b) => (Int -> a) -> (a -> b) -> Int -> Benchmark
expCodeGenBench f p i = env (expCodeGen f i) $ bench ("10e" <> show i) . nf p ; {-# INLINE expCodeGenBench #-}
-------------------------------
-- === (a*) list parsing === --
-------------------------------
genList_a :: Int -> [Char]
genList_a i = replicate i 'a' ; {-# INLINE genList_a #-}
pureListParser_a :: [Char] -> Bool
pureListParser_a = \case
'a':s -> pureListParser_a s
[] -> True
_ -> False
{-# INLINE pureListParser_a #-}
mtlStateListParser_a :: State.MonadState [Char] m => m Bool
mtlStateListParser_a = State.get >>= \case
'a':s -> State.put s >> mtlStateListParser_a
[] -> return True
_ -> return False
{-# INLINE mtlStateListParser_a #-}
mtlStateListParser_a_typed :: Strict.State [Char] Bool
mtlStateListParser_a_typed = State.get >>= \case
'a':s -> State.put s >> mtlStateListParser_a_typed
[] -> return True
_ -> return False
{-# INLINE mtlStateListParser_a_typed #-}
mtlStateListParser_a_let :: Strict.MonadState [Char] m => m Bool
mtlStateListParser_a_let = go where
go = Strict.get >>= \case
'a':s -> Strict.put s >> go
[] -> return True
_ -> return False
{-# INLINE mtlStateListParser_a_let #-}
{-# SPECIALIZE mtlStateListParser_a :: Strict.State [Char] Bool #-}
{-# SPECIALIZE mtlStateListParser_a_typed :: Strict.State [Char] Bool #-}
main = do
defaultMain
[ bgroup "a*" $
[ bgroup "pure" $ expCodeGenBench genList_a pureListParser_a <$> [6..6]
, bgroup "mtl.State.Strict" $ expCodeGenBench genList_a (Strict.evalState mtlStateListParser_a) <$> [6..6]
, bgroup "mtl.State.Strict typed" $ expCodeGenBench genList_a (Strict.evalState mtlStateListParser_a_typed) <$> [6..6]
, bgroup "mtl.State.Strict let" $ expCodeGenBench genList_a (Strict.evalState mtlStateListParser_a_let) <$> [6..6]
]
]
The code was compiled with following options (and many other variations): -threaded -funbox-strict-fields -O2 -fconstraint-solver-iterations=100 -funfolding-use-threshold=10000 -fexpose-all-unfoldings -fsimpl-tick-factor=1000 -flate-dmd-anal
Everything in this code has INLINE pragma. The important part we should focus on are these two functions:
pureListParser_a :: [Char] -> Bool
pureListParser_a = \case
'a':s -> pureListParser_a s
[] -> True
_ -> False
{-# INLINE pureListParser_a #-}
mtlStateListParser_a :: State.MonadState [Char] m => m Bool
mtlStateListParser_a = State.get >>= \case
'a':s -> State.put s >> mtlStateListParser_a
[] -> return True
_ -> return False
{-# INLINE mtlStateListParser_a #-}
Which are just "parsers" accepting strings containing only 'a' characters. The former is pure one, while the later uses State to keep the remaining input. The following list contains performance related observations:
- For the rest of the points, let's call the performance of
pureListParser_aa "good" one and everything worse a "bad" one. - The performance of
mtlStateListParser_ais bad, it runs 10 times slower thanpureListParser_a. Inspecting CORE we can observe that GHC jumps between(# a,b #)and(a,b)representations all the time. - If we add a specialize pragma
{-# SPECIALIZE mtlStateListParser_a :: Strict.State [Char] Bool #-}, the performance ofmtlStateListParser_ais good (exactly the same aspureListParser_a). - If we do NOT use specialize pragma, but we use explicite, non-polymorphic type signature
mtlStateListParser_a_typed :: Strict.State [Char] Bool, the performance is bad (!), identical to the polymorphic version without specialization. - If we use SPECIALIZE pragma together with explicite, non-polymorphic type, so we use BOTH
mtlStateListParser_a_typed :: Strict.State [Char] BoolAND{-# SPECIALIZE mtlStateListParser_a_typed :: Strict.State [Char] Bool #-}we get the good performance. - If we transform
pureListParser_ato
mtlStateListParser_a_let :: Strict.MonadState [Char] m => m Bool
mtlStateListParser_a_let = go where
go = Strict.get >>= \case
'a':s -> Strict.put s >> go
[] -> return True
_ -> return False
{-# INLINE mtlStateListParser_a_let #-}
we again get the good performance without the need to use SPECIALIZE pragmas.
- The performance of all the functions that are not optimized as good as
pureListParser_ais a lot worse in GHC 8.2.1-rc3 than in 8.0.2. - The not-yet documented flag
-fspecialise-aggressivelydoes NOT affect the results (https://ghc.haskell.org/trac/ghc/ticket/12463). - If you do NOT use
INLINEpragma on functionsmtlStateListParser_aandmtlStateListParser_a_typedtheir performance is good (soINLINEpragma makes it bad until we provide explicit specialization). Moreover, if we useINLINABLEpragma instead ofINLINEon these functions (which logically makes more sense, because they are recursive), performance of the polymorphic onemtlStateListParser_ais good, while performance of the explicitly typedmtlStateListParser_a_typedis bad until we provide explicite specialization.
The above points raise the following questions:
- Why GHC does not optimize
mtlStateListParser_athe same way aspureListParser_aand where the jumping between(# a,b #)and(a,b)comes from? - Is there any way to tell GHC to automatically insert
SPECIALIZEpragmas, especially in performance critical code? - Why providing very-explicite type signature
mtlStateListParser_a_typed :: Strict.State [Char] Booldoes not solve the problem unless we useSPECIALIZEpragma that tells the same as the signature? (GHC even warns:SPECIALISE pragma for non-overloaded function ‘mtlStateListParser_a_typed’but it affects performance.) - Why the trick to alias the body of recursive function to a local variable
goaffects the performance in any way, especially when it does NOT bring any variable to the local let scope?
We've been testing this behavior in GHC 8.0.2 and 8.2.1-rc3 and several people reported exactly the same observations.