CPR related performance issue
By default CRP analysis can be too aggressive in trying to pass as much as possible in unboxed tuples, in general it's not a problem but when one big datatype is passed to several consumers it might end up pushed to stack several times instead of once - to heap, things are getting worse when there are sufficient fields to cause stack overflow which otherwise is possible to avoid - in our codebase adding one field with ExistentialQuantification (unused, but that prevents ghc from doing CRP transformation) reduces number of stack overflow by a factor of 1000 and increases overall performance by 10%.
In provided example performance for both A and B should be identical and yet B is consistently faster by 3-5%
It's possible to increase this performance gap by adding more and more fields.
I was able to replicate this issue in ghc 7.8.3 and 7.10,1rc2
{-# LANGUAGE ExistentialQuantification #-}
module Blah where
import Criterion
import Criterion.Main
import Data.Typeable
data A = A ()
!Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int
!Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int
data B = forall rep. (Typeable rep) => B rep
!Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int
!Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int
a :: A
a = A () 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
b :: B
b = B () 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
{-# NOINLINE a1 #-}
a1 :: A -> Int
a1 (A _ f1 f2 f3 f4 f5 f6 f7 f8 g1 g2 g3 g4 g5 g6 g7 g8 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _) = f1
{-# NOINLINE a2 #-}
a2 :: A -> Int
a2 (A _ f1 f2 f3 f4 f5 f6 f7 f8 g1 g2 g3 g4 g5 g6 g7 g8 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _) = f2
{-# NOINLINE a3 #-}
a3 :: A -> Int
a3 (A _ f1 f2 f3 f4 f5 f6 f7 f8 g1 g2 g3 g4 g5 g6 g7 g8 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _) = f3
{-# NOINLINE a4 #-}
a4 :: A -> Int
a4 (A _ f1 f2 f3 f4 f5 f6 f7 f8 g1 g2 g3 g4 g5 g6 g7 g8 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _) = f4
{-# NOINLINE b1 #-}
b1 :: B -> Int
b1 (B _ f1 f2 f3 f4 f5 f6 f7 f8 g1 g2 g3 g4 g5 g6 g7 g8 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _) = f1
{-# NOINLINE b2 #-}
b2 :: B -> Int
b2 (B _ f1 f2 f3 f4 f5 f6 f7 f8 g1 g2 g3 g4 g5 g6 g7 g8 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _) = f2
{-# NOINLINE b3 #-}
b3 :: B -> Int
b3 (B _ f1 f2 f3 f4 f5 f6 f7 f8 g1 g2 g3 g4 g5 g6 g7 g8 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _) = f3
{-# NOINLINE b4 #-}
b4 :: B -> Int
b4 (B _ f1 f2 f3 f4 f5 f6 f7 f8 g1 g2 g3 g4 g5 g6 g7 g8 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _) = f4
{-# NOINLINE fa #-}
fa :: A -> Int
fa a = a1 a + a2 a + a3 a + a4 a
{-# NOINLINE fb #-}
fb :: B -> Int
fb b = b1 b + b2 b + b3 b + b4 b
main :: IO ()
main = defaultMain [
bgroup "single call" [
bench "A" $ whnf fa a
, bench "B" $ whnf fb b
]
]
Trac metadata
Trac field | Value |
---|---|
Version | 7.10.1-rc2 |
Type | Bug |
TypeOfFailure | OtherFailure |
Priority | normal |
Resolution | Unresolved |
Component | Compiler |
Test case | |
Differential revisions | |
BlockedBy | |
Related | |
Blocking | |
CC | |
Operating system | |
Architecture |