Skip to content

Unexplained performance boost with +RTS -h

In #5505 (closed) the following program was reported:

{-# LANGUAGE 
        RankNTypes
        ,RecursiveDo
        ,TypeFamilies
        ,GeneralizedNewtypeDeriving
  #-}

import System.TimeIt
import qualified Data.Vector as V
import Control.Monad.State.Strict
import Control.Monad.ST.Strict
import Data.STRef

-- STO Monad is just StateT with an Int stacked on top of a ST monad.
newtype STO s a = STO { unSTO :: StateT Int (ST s) a }
    deriving (Functor, Monad, MonadFix)

runSTO :: (forall s. STO s a) -> a
runSTO x = runST (evalStateT (unSTO x) 0)

data CircList s = ConsM {value :: {-# UNPACK #-} !Int
                        ,cons  :: {-# UNPACK #-} !(STRef s (CircList s )) }

-- | Defines a circular list of length ns
clist ns = do
    rec allItems <-
            let next i = nextCons i $ (V.!) allItems ((i+1) `mod` ns)
            in V.generateM ns next
    return $ (V.!) allItems 0
    where nextCons v n = do
                        n' <- switchSTorSTO  $ newSTRef $ n
                        return $ ConsM v n'
                        
-- | Take one step in the CircList
oneStep (ConsM v c) = switchSTorSTO $ readSTRef c

-- | I tie a circular list of size 100 and step through it n times.
main :: IO ()
main = do 
    let n = 333222111 :: Int
    timeIt . print $ runSTO ( clist 1001 >>= liftM value . iterateM n oneStep)
    --timeIt . print $ runST ( clist 1001 >>= liftM value . iterateM n oneStep)
--  ****************************  TO SWITCH TO ST: switch between the two sets of definitions above and below this line
--switchSTorSTO = id
switchSTorSTO  = STO . lift

iterateM :: (Monad m) => Int -> (b -> m b) -> b -> m b
iterateM n f c = go n c where 
    go 0 b = return b
    go ns b = f b >>= go (ns-1)

Which on my system, when compiled for profiling, goes more than twice as fast with +RTS -h. The following is some output from perf stat with both versions:

> perf stat -e cycles -e instructions -e L1-dcache-load-misses -e stalled-cycles-backend ./5505        
222
CPU time:   5.34s

 Performance counter stats for './5505':

    10,676,916,633 cycles                    #    0.000 GHz                     [75.00%]
    11,023,609,154 instructions              #    1.03  insns per cycle        
                                             #    0.60  stalled cycles per insn [75.01%]
       462,115,593 L1-dcache-load-misses                                        [75.01%]
     6,661,087,690 stalled-cycles-backend    #   62.39% backend  cycles idle    [75.01%]

       5.346786849 seconds time elapsed

and for the fast version:

> perf stat -e cycles -e instructions -e L1-dcache-load-misses -e stalled-cycles-backend ./5505 +RTS -h
222
CPU time:   2.38s

 Performance counter stats for './5505 +RTS -h':

     4,766,694,528 cycles                    #    0.000 GHz                     [75.02%]
    10,691,687,462 instructions              #    2.24  insns per cycle        
                                             #    0.06  stalled cycles per insn [75.02%]
        18,763,412 L1-dcache-load-misses                                        [75.02%]
       673,890,507 stalled-cycles-backend    #   14.14% backend  cycles idle    [74.99%]

       2.399953273 seconds time elapsed

There's a huge difference in the number of L1-dcache misses, and the slow version is stalled for 62% of the time.

Why is this? I'm not exactly sure. The program doesn't allocate anything except at the beginning. The effect of -h is to cause a major GC to happen every 0.1s or so, which copies some data, which should tend to cause more cache misses, rather than fewer.

I don't know what's going on, so I'm leaving the details in this ticket for now.

Trac metadata
Trac field Value
Version 7.6.1
Type Bug
TypeOfFailure OtherFailure
Priority normal
Resolution Unresolved
Component Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information