Unexplained performance boost with +RTS -h
In #5505 (closed) the following program was reported:
{-# LANGUAGE
RankNTypes
,RecursiveDo
,TypeFamilies
,GeneralizedNewtypeDeriving
#-}
import System.TimeIt
import qualified Data.Vector as V
import Control.Monad.State.Strict
import Control.Monad.ST.Strict
import Data.STRef
-- STO Monad is just StateT with an Int stacked on top of a ST monad.
newtype STO s a = STO { unSTO :: StateT Int (ST s) a }
deriving (Functor, Monad, MonadFix)
runSTO :: (forall s. STO s a) -> a
runSTO x = runST (evalStateT (unSTO x) 0)
data CircList s = ConsM {value :: {-# UNPACK #-} !Int
,cons :: {-# UNPACK #-} !(STRef s (CircList s )) }
-- | Defines a circular list of length ns
clist ns = do
rec allItems <-
let next i = nextCons i $ (V.!) allItems ((i+1) `mod` ns)
in V.generateM ns next
return $ (V.!) allItems 0
where nextCons v n = do
n' <- switchSTorSTO $ newSTRef $ n
return $ ConsM v n'
-- | Take one step in the CircList
oneStep (ConsM v c) = switchSTorSTO $ readSTRef c
-- | I tie a circular list of size 100 and step through it n times.
main :: IO ()
main = do
let n = 333222111 :: Int
timeIt . print $ runSTO ( clist 1001 >>= liftM value . iterateM n oneStep)
--timeIt . print $ runST ( clist 1001 >>= liftM value . iterateM n oneStep)
-- **************************** TO SWITCH TO ST: switch between the two sets of definitions above and below this line
--switchSTorSTO = id
switchSTorSTO = STO . lift
iterateM :: (Monad m) => Int -> (b -> m b) -> b -> m b
iterateM n f c = go n c where
go 0 b = return b
go ns b = f b >>= go (ns-1)
Which on my system, when compiled for profiling, goes more than twice as fast with +RTS -h
. The following is some output from perf stat
with both versions:
> perf stat -e cycles -e instructions -e L1-dcache-load-misses -e stalled-cycles-backend ./5505
222
CPU time: 5.34s
Performance counter stats for './5505':
10,676,916,633 cycles # 0.000 GHz [75.00%]
11,023,609,154 instructions # 1.03 insns per cycle
# 0.60 stalled cycles per insn [75.01%]
462,115,593 L1-dcache-load-misses [75.01%]
6,661,087,690 stalled-cycles-backend # 62.39% backend cycles idle [75.01%]
5.346786849 seconds time elapsed
and for the fast version:
> perf stat -e cycles -e instructions -e L1-dcache-load-misses -e stalled-cycles-backend ./5505 +RTS -h
222
CPU time: 2.38s
Performance counter stats for './5505 +RTS -h':
4,766,694,528 cycles # 0.000 GHz [75.02%]
10,691,687,462 instructions # 2.24 insns per cycle
# 0.06 stalled cycles per insn [75.02%]
18,763,412 L1-dcache-load-misses [75.02%]
673,890,507 stalled-cycles-backend # 14.14% backend cycles idle [74.99%]
2.399953273 seconds time elapsed
There's a huge difference in the number of L1-dcache misses, and the slow version is stalled for 62% of the time.
Why is this? I'm not exactly sure. The program doesn't allocate anything except at the beginning. The effect of -h
is to cause a major GC to happen every 0.1s or so, which copies some data, which should tend to cause more cache misses, rather than fewer.
I don't know what's going on, so I'm leaving the details in this ticket for now.
Trac metadata
Trac field | Value |
---|---|
Version | 7.6.1 |
Type | Bug |
TypeOfFailure | OtherFailure |
Priority | normal |
Resolution | Unresolved |
Component | Compiler |
Test case | |
Differential revisions | |
BlockedBy | |
Related | |
Blocking | |
CC | |
Operating system | |
Architecture |