Weird performance results.
Hi! I was recently testing performance of a critical code in a product we are shipping and I'm getting really weird results.
- *The code is compiled with
-XStrict
enabled globally. The full source code for this ticket is attached, while the exposed code below uses...
to hide some non-important implementations.**
To get desired results, we use following GHC flags: -O2 -funfolding-use-threshold=10000
.
Let's consider the following program. It is just a pseudo-parser implementation. It consumes 'a' chars in a loop and fails on empty input in the end:
-- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
-- | WARNING: -XStrict enabled in this file !!!
-- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
module Main where
imports ... (full source attached to this ticket)
------------------------
-- === Primitives === --
------------------------
-- === Strict Either === --
data Either e a = Left e | Right a deriving (Eq, Generic, Ord, Read, Show, Functor)
newtype EitherT e m a = EitherT { runEitherT :: m (Either e a) }
instance Monad m => Functor (EitherT e m) where ...
instance Monad m => Applicative (EitherT e m) where ...
instance Monad m => Monad (EitherT e m) where ...
-- === Strict Bool === --
data XBool = XTrue | XFalse deriving (Show, Generic)
(|||) :: XBool -> XBool -> XBool
(|||) !a !b = case a of
XTrue -> a
XFalse -> b
{-# INLINE (|||) #-}
-- === Strict Tuple === --
data T a b = T !a !b deriving (Generic, Show, Functor)
------------------------
-- === FailParser === --
------------------------
-- === Definition === --
-- | It is just like EitherT, but also contains progress indicator - a field of type XBool
-- which tells us if we've already parsed a char or not yet. In this snippet code however,
-- it does not do anything valuable - it just stores the value.
newtype FailParser m a = FailParser { fromFailParser :: EitherT () m (T XBool a) } deriving (Functor)
instance Monad m => Applicative (FailParser m) where
pure = undefined
(<*>) = undefined
instance Monad m => Monad (FailParser m) where
return a = FailParser $ return $ (T XFalse a) ; {-# INLINE return #-}
FailParser ma >>= f = FailParser $ do
T !b !a <- ma
T !b' !a' <- fromFailParser $ f a
return $ T (b ||| b') a'
{-# INLINE (>>=) #-}
_ >> _ = undefined ; {-# INLINE (>>) #-}
-- === Running === --
failParser :: m (Either () (T XBool a)) -> FailParser m a
failParser a = FailParser $ EitherT a ; {-# INLINE failParser #-}
runFailParser :: forall m a. FailParser m a -> m (Either () (T XBool a))
runFailParser f = runEitherT $ fromFailParser f ; {-# INLINE runFailParser #-}
-- === MonadFailedParser === --
-- | Behaves just like "left" - lifts until it hits MonadFailedParser
class Monad m => MonadFailedParser m where
failed :: m a
instance {-# OVERLAPPABLE #-} (MonadFailedParser m, MonadTrans t, Monad (t m))
=> MonadFailedParser (t m) where
failed = lift failed ; {-# INLINE failed #-}
instance Monad m => MonadFailedParser (FailParser m) where
failed = failParser $ return $ Left () ; {-# INLINE failed #-}
-----------------------
-- === Main loop === --
-----------------------
parserLoop :: StateT Text (FailParser Identity) Bool
parserLoop = parserStep >> parserLoop
parserStep :: StateT Text (FailParser Identity) Char
parserStep = get >>= \s -> case Text.uncons s of
Just (!t, !s') -> if t == 'a' then put s' >> return t else failed
Nothing -> failed
{-# INLINE parserStep #-}
-- === Criterion === --
instance NFData XBool
instance (NFData l, NFData r) => NFData (Either l r)
instance (NFData a, NFData b) => NFData (T a b)
genText :: Int -> Text
genText i = fromString $ replicate i 'a' ; {-# INLINE genText #-}
a_parsing_main :: IO ()
a_parsing_main = do
defaultMain
[ env (return $ genText $ 10^6) $ bench "a*" . nf (runIdentity . runFailParser . evalStateT parserLoop)
]
main = a_parsing_main
The most important part is the bind
implementation of FailParser
:
FailParser ma >>= f = FailParser $ do
T b a <- ma
T b' a' <- fromFailParser $ f a
return $ T (b ||| b') a'
There are several performance related observations and problems:
- **INFO:** Everything is compiled with
-XStrict
and every field in this code is fully evaluated, in particularb
andb'
are fully evaluated, strict values of typeXBool
. - **INFO:** Neither
b
norb'
are used anywhere else in the code. They are just fields inFailParser
which should be used to store information if we did consume a letter or we did not. - **PROBLEM:** When provided with
10^6
characters this code works in 1ms. If we replace(b ||| b')
with(b' ||| b)
or with(b')
the time do NOT change. However, if we replace it with(b)
, we've got **15 times** slowdown. Moreover, the resulting core is changed drastically in some places. - **PROBLEM:** Another interesting observation is that the value of
XBool
is created only in one place in the code, namely in:FailParser
'sMonad
implementation, inreturn
function:return a = FailParser $ return $ (T XFalse a) ; {-# INLINE return #-}
. We never change the XFalse, so this is the only value that could appear in this code. If we change it toXTrue
in this implementation however, we again get **15 times** slowdown. - **INFO:** The order of
case
expressions in definition of(|||)
or the order of constructor defintions of any datatype does not affect the above results.