Performance Regression in splitmix from GHC 9.2.8 to GHC 9.4.1 (and later)
Summary
When using the nextDouble
function from the splitmix package, GHC 9.4.1, GHC 9.4.7, and GHC 9.6.2 all generate code that performs worse than the code generated by GHC 8.10.7, GHC 9.0.2, and GHC 9.2.8.
Steps to reproduce
I have come up with the following minimal example:
{-# LANGUAGE BangPatterns #-}
import System.Random.SplitMix
import Test.Tasty.Bench
{-# NOINLINE loop #-}
loop :: Int -> Double -> SMGen -> (Double, SMGen)
loop 0 !a !s = (a, s)
loop n !a !s = loop (n - 1) (a + b) t where (b, t) = nextDouble s
main :: IO ()
main = defaultMain [bench "main" $ whnf (fst . loop 1000000 0) (mkSMGen 0)]
When compiled on various GHC versions using -O2
and run using +RTS -T
, this results in:
GHC 8.10.7: 2.78 ms ± 105 μs, 0 B allocated, 0 B copied, 2.0 MB peak memory
GHC 9.0.2: 2.83 ms ± 195 μs, 0 B allocated, 0 B copied, 2.0 MB peak memory
GHC 9.2.8: 2.77 ms ± 229 μs, 0 B allocated, 0 B copied, 6.0 MB peak memory
GHC 9.4.1: 4.62 ms ± 345 μs, 15 MB allocated, 559 B copied, 6.0 MB peak memory
GHC 9.4.7: 4.93 ms ± 427 μs, 15 MB allocated, 559 B copied, 6.0 MB peak memory
GHC 9.6.2: 5.24 ms ± 270 μs, 15 MB allocated, 477 B copied, 6.0 MB peak memory
Expected behavior
I expect this code to run without allocating any heap memory.
Investigation
I have looked at the core generated by different versions of GHC.
GHC 9.2.8
Rec {
-- RHS size: {terms: 52, types: 12, coercions: 0, joins: 0/3}
$wloop
= \ ww ww1 ww2 ww3 ->
case ww of ds {
__DEFAULT ->
let { seed' = plusWord# ww2 ww3 } in
let {
x#
= timesWord#
(xor# seed' (uncheckedShiftRL# seed' 33#))
18397679294719823053## } in
let {
x#1
= timesWord#
(xor# x# (uncheckedShiftRL# x# 33#)) 14181476777654086739## } in
$wloop
(-# ds 1#)
(+##
ww1
(*##
(word2Double#
(uncheckedShiftRL# (xor# x#1 (uncheckedShiftRL# x#1 33#)) 11#))
1.1102230246251565e-16##))
seed'
ww3;
0# -> (# D# ww1, SMGen ww2 ww3 #)
}
end Rec }
GHC 9.4.1
Rec {
$wloop
= \ ww ww1 ww2 ww3 ->
case ww of ds {
__DEFAULT ->
let { seed' = plusWord64# ww2 ww3 } in
let {
x#
= timesWord64#
(xor64# seed' (uncheckedShiftRL64# seed' 33#))
18397679294719823053##64 } in
let {
x#1
= timesWord64#
(xor64# x# (uncheckedShiftRL64# x# 33#))
14181476777654086739##64 } in
case integerToDouble#
(integerFromWord64#
(uncheckedShiftRL64#
(xor64# x#1 (uncheckedShiftRL64# x#1 33#)) 11#))
of wild1
{ __DEFAULT ->
$wloop
(-# ds 1#) (+## ww1 (*## wild1 1.1102230246251565e-16##)) seed' ww3
};
0# -> (# ww1, ww2, ww3 #)
}
end Rec }
Analysis
In splitmix, nextDouble
is defined as follows:
nextDouble :: SMGen -> (Double, SMGen)
nextDouble g = case nextWord64 g of
(w64, g') -> (fromIntegral (w64 `shiftR` 11) * doubleUlp, g')
It looks to me like the main difference is that 9.2.8 translates fromIntegral
to word2Double#
, while 9.4.1 translates it to integerToDouble# . integerFromWord64#
. I suspect that this is where the 16 bytes of allocated heap memory per loop iteration come from.
Cores from GHC versions earlier than 9.2.8 are fairly unreadable but also use word2Double#
, while cores from versions later than 9.4.1 look almost identical to the one from 9.4.1.