Inconsistent performance of simple loops with NCG
Summary
The NCG generates code for simple loops that performs very differently depending on where exactly the loop inlines.
Steps to reproduce
Compile the following code:
{-# language BangPatterns #-}
{-# OPTIONS_GHC -O2 #-}
import Criterion
import Criterion.Main
import Data.Primitive
import GHC.Exts (RealWorld)
import Control.Monad.Primitive
main :: IO ()
main = do
let sz = 100000
arr <- newPrimArray sz
defaultMain
[ bench "first" (whnfIO (incrementHighToLowCase arr))
, bench "second" (whnfIO (incrementHighToLowCase arr))
, bench "third" (whnfIO (incrementHighToLowCase arr))
, bench "fourth" (whnfIO (incrementHighToLowCase arr))
, bench "fifth" (whnfIO (incrementHighToLowCase arr))
, bench "sixth" (whnfIO (incrementHighToLowCase arr))
]
incrementHighToLowCase :: MutablePrimArray RealWorld Int -> IO ()
incrementHighToLowCase m = modify' (+1) m
modify' :: (Prim a, PrimMonad m)
=> (a -> a)
-> MutablePrimArray (PrimState m) a
-> m ()
modify' f marr = do
let go !ix = case ix of
(-1) -> return ()
_ -> do
x <- readPrimArray marr ix
let !y = f x
writePrimArray marr ix y
go (ix - 1)
!sz <- getSizeofMutablePrimArray marr
go (sz - 1)
There are three different flavors I've explored. One is to compile the code as is with the NCG (Variant A). One is to compile the code as in with -fllvm
to use the LLVM backend (Variant B). The third is to compile the code with the NCG after explicitly marking incrementHighToLowCase
as NOINLINE
(Variant C). For a ligther dependency footprint, it is possible to use gauge
instead of criterion
. The results are the same.
Expected behavior
I expect for all six benchmarks to have the same performance. Instead, I get this:
benchmarking first
time 90.90 μs (90.39 μs .. 91.53 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 90.66 μs (90.36 μs .. 91.09 μs)
std dev 1.187 μs (916.3 ns .. 1.689 μs)
benchmarking second
time 83.24 μs (82.96 μs .. 83.58 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 83.30 μs (83.10 μs .. 83.55 μs)
std dev 725.9 ns (580.6 ns .. 956.9 ns)
benchmarking third
time 56.02 μs (55.82 μs .. 56.25 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 56.15 μs (56.00 μs .. 56.37 μs)
std dev 604.7 ns (514.3 ns .. 761.8 ns)
benchmarking fourth
time 87.08 μs (86.40 μs .. 87.75 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 86.85 μs (86.38 μs .. 87.37 μs)
std dev 1.666 μs (1.395 μs .. 2.062 μs)
variance introduced by outliers: 14% (moderately inflated)
benchmarking fifth
time 94.35 μs (93.96 μs .. 94.81 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 94.28 μs (94.00 μs .. 94.65 μs)
std dev 1.001 μs (803.7 ns .. 1.299 μs)
benchmarking sixth
time 55.48 μs (55.36 μs .. 55.65 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 55.57 μs (55.46 μs .. 55.72 μs)
std dev 404.1 ns (337.9 ns .. 502.8 ns)
Strangely, benchmarks third
and sixth
are much faster than the others, but this only happens in Variant A. The other two variants show very consistent performance of the function. Variant B gives durations that hover within a percentage point of 44.47 μs, and Variant C gives durations that hover within a percentage point of 83.07 μs.
With -ddump-simple
, I've been able to confirm that (without a NOINLINE
pragma) the loop gets inlined into six different places in the benchmarking code, once into each individual benchmark. What I don't understand is why the NCG handles this so inconsistently while LLVM is fine with it.
This whole issue may not seem like a big deal, but I ran into this by encountering some benchmark results that seemed wrong. That is, the implementation of an algorithm that was supposed to be faster ended up being slower. By simply reordering the benchmarks, I was able to observe very different behavior. Suddenly, the implementation was previously slower became faster. It took me a while to realize that the NCG was compiling simple loops differently depending on which cons cell of the _ : _ : _ : _ : _ : _ : []
they showed up in. I almost made some erroneous conclusions about the implementations I was comparing, but I happened to switch around the order of the benchmarks and realized that the conclusions I had previously reached were wrong.
Environment
- GHC version used: 8.6.4
Optional:
- Operating System: Ubuntu
- System Architecture: x86-64