Weird performance results.

Hi! I was recently testing performance of a critical code in a product we are shipping and I'm getting really weird results.

*The code is compiled with -XStrict enabled globally. The full source code for this ticket is attached, while the exposed code below uses ... to hide some non-important implementations.**

To get desired results, we use following GHC flags: -O2 -funfolding-use-threshold=10000.

Let's consider the following program. It is just a pseudo-parser implementation. It consumes 'a' chars in a loop and fails on empty input in the end:

-- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
-- | WARNING: -XStrict enabled in this file !!!
-- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

module Main where

imports ... (full source attached to this ticket)

------------------------
-- === Primitives === --
------------------------

-- === Strict Either === --

data    Either  e   a = Left e | Right a deriving (Eq, Generic, Ord, Read, Show, Functor)
newtype EitherT e m a = EitherT { runEitherT :: m (Either e a) }

instance Monad m => Functor (EitherT e m) where ...
instance Monad m => Applicative (EitherT e m) where ...
instance Monad m => Monad (EitherT e m) where ...


-- === Strict Bool === --

data XBool = XTrue | XFalse deriving (Show, Generic)

(|||) :: XBool -> XBool -> XBool
(|||) !a !b = case a of
    XTrue  -> a
    XFalse -> b
{-# INLINE (|||) #-}


-- === Strict Tuple === --

data T a b = T !a !b deriving (Generic, Show, Functor)


------------------------
-- === FailParser === --
------------------------

-- === Definition === --
-- | It is just like EitherT, but also contains progress indicator - a field of type XBool
--   which tells us if we've already parsed a char or not yet. In this snippet code however,
--   it does not do anything valuable - it just stores the value.

newtype FailParser m a = FailParser { fromFailParser :: EitherT () m (T XBool a) } deriving (Functor)

instance Monad m => Applicative (FailParser m) where
    pure  = undefined
    (<*>) = undefined

instance Monad m => Monad (FailParser m) where
    return a = FailParser $ return $ (T XFalse a) ; {-# INLINE return #-}
    FailParser ma >>= f = FailParser $ do
        T !b  !a  <- ma
        T !b' !a' <- fromFailParser $ f a
        return $ T (b ||| b') a'
    {-# INLINE (>>=) #-}
    _ >> _ = undefined ; {-# INLINE (>>) #-}


-- === Running === --

failParser :: m (Either () (T XBool a)) -> FailParser m a
failParser a = FailParser $ EitherT a ; {-# INLINE failParser #-}

runFailParser :: forall m a. FailParser m a -> m (Either () (T XBool a))
runFailParser f = runEitherT $ fromFailParser f ; {-# INLINE runFailParser #-}


-- === MonadFailedParser === --
-- | Behaves just like "left" - lifts until it hits MonadFailedParser

class Monad m => MonadFailedParser m where
    failed :: m a

instance {-# OVERLAPPABLE #-} (MonadFailedParser m, MonadTrans t, Monad (t m))
      => MonadFailedParser (t m) where
    failed = lift failed ; {-# INLINE failed #-}

instance Monad m => MonadFailedParser (FailParser m) where
    failed = failParser $ return $ Left () ; {-# INLINE failed #-}


-----------------------
-- === Main loop === --
-----------------------

parserLoop :: StateT Text (FailParser Identity) Bool
parserLoop = parserStep >> parserLoop

parserStep :: StateT Text (FailParser Identity) Char
parserStep = get >>= \s -> case Text.uncons s of
    Just (!t, !s') -> if t == 'a' then put s' >> return t else failed
    Nothing        -> failed
{-# INLINE parserStep #-}


-- === Criterion === --

instance NFData XBool
instance (NFData l, NFData r) => NFData (Either l r)
instance (NFData a, NFData b) => NFData (T a b)

genText :: Int -> Text
genText i = fromString $ replicate i 'a' ; {-# INLINE genText #-}

a_parsing_main :: IO ()
a_parsing_main = do
    defaultMain
        [ env (return $ genText $ 10^6) $ bench "a*" . nf (runIdentity . runFailParser . evalStateT parserLoop)
        ]


main = a_parsing_main

The most important part is the bind implementation of FailParser:

FailParser ma >>= f = FailParser $ do
    T b  a  <- ma
    T b' a' <- fromFailParser $ f a
    return $ T (b ||| b') a'

There are several performance related observations and problems:

**INFO:** Everything is compiled with -XStrict and every field in this code is fully evaluated, in particular b and b' are fully evaluated, strict values of type XBool.
**INFO:** Neither b nor b' are used anywhere else in the code. They are just fields in FailParser which should be used to store information if we did consume a letter or we did not.
**PROBLEM:** When provided with 10^6 characters this code works in 1ms. If we replace (b ||| b') with (b' ||| b) or with (b') the time do NOT change. However, if we replace it with (b), we've got **15 times** slowdown. Moreover, the resulting core is changed drastically in some places.
**PROBLEM:** Another interesting observation is that the value of XBool is created only in one place in the code, namely in: FailParser's Monad implementation, in return function: return a = FailParser $ return $ (T XFalse a) ; {-# INLINE return #-}. We never change the XFalse, so this is the only value that could appear in this code. If we change it to XTrue in this implementation however, we again get **15 times** slowdown.
**INFO:** The order of case expressions in definition of (|||) or the order of constructor defintions of any datatype does not affect the above results.

Edited Mar 10, 2019 by danilo2

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

Weird performance results.