Inconsistent performance of simple loops with NCG

Summary

The NCG generates code for simple loops that performs very differently depending on where exactly the loop inlines.

Steps to reproduce

Compile the following code:

{-# language BangPatterns #-}
{-# OPTIONS_GHC -O2 #-}

import Criterion
import Criterion.Main
import Data.Primitive
import GHC.Exts (RealWorld)
import Control.Monad.Primitive

main :: IO ()
main = do
  let sz = 100000
  arr <- newPrimArray sz
  defaultMain
    [ bench "first" (whnfIO (incrementHighToLowCase arr))
    , bench "second" (whnfIO (incrementHighToLowCase arr))
    , bench "third" (whnfIO (incrementHighToLowCase arr))
    , bench "fourth" (whnfIO (incrementHighToLowCase arr))
    , bench "fifth" (whnfIO (incrementHighToLowCase arr))
    , bench "sixth" (whnfIO (incrementHighToLowCase arr))
    ]

incrementHighToLowCase :: MutablePrimArray RealWorld Int -> IO ()
incrementHighToLowCase m = modify' (+1) m

modify' :: (Prim a, PrimMonad m)
  => (a -> a)
  -> MutablePrimArray (PrimState m) a
  -> m ()
modify' f marr = do
  let go !ix = case ix of
        (-1) -> return ()
        _ -> do
          x <- readPrimArray marr ix
          let !y = f x
          writePrimArray marr ix y
          go (ix - 1)
  !sz <- getSizeofMutablePrimArray marr
  go (sz - 1)

There are three different flavors I've explored. One is to compile the code as is with the NCG (Variant A). One is to compile the code as in with -fllvm to use the LLVM backend (Variant B). The third is to compile the code with the NCG after explicitly marking incrementHighToLowCase as NOINLINE (Variant C). For a ligther dependency footprint, it is possible to use gauge instead of criterion. The results are the same.

Expected behavior

I expect for all six benchmarks to have the same performance. Instead, I get this:

benchmarking first                                                                                                                                               
time                 90.90 μs   (90.39 μs .. 91.53 μs)                                                                                                                       
                     1.000 R²   (0.999 R² .. 1.000 R²)                                                                                                                         
mean                 90.66 μs   (90.36 μs .. 91.09 μs)                                                                                                                  
std dev              1.187 μs   (916.3 ns .. 1.689 μs)                                                                                                               
                                                                                                                                                                        
benchmarking second                                                                                                                                                         
time                 83.24 μs   (82.96 μs .. 83.58 μs)                                                                                                             
                     1.000 R²   (1.000 R² .. 1.000 R²)                                                                                                             
mean                 83.30 μs   (83.10 μs .. 83.55 μs)                                                                                                                        
std dev              725.9 ns   (580.6 ns .. 956.9 ns)                                                                                                                            
                                                                                                                                                                                           
benchmarking third                                                                                                                                                                
time                 56.02 μs   (55.82 μs .. 56.25 μs)                                                                                                                            
                     1.000 R²   (1.000 R² .. 1.000 R²)                                                                                                                      
mean                 56.15 μs   (56.00 μs .. 56.37 μs)                                                                                                             
std dev              604.7 ns   (514.3 ns .. 761.8 ns)                                                                                                             
                                                                                                                                                                        
benchmarking fourth                                                                                                                                                
time                 87.08 μs   (86.40 μs .. 87.75 μs)                                                                                                             
                     1.000 R²   (0.999 R² .. 1.000 R²)                                                                                                                       
mean                 86.85 μs   (86.38 μs .. 87.37 μs)                                                                                                                      
std dev              1.666 μs   (1.395 μs .. 2.062 μs)
variance introduced by outliers: 14% (moderately inflated)

benchmarking fifth
time                 94.35 μs   (93.96 μs .. 94.81 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 94.28 μs   (94.00 μs .. 94.65 μs)
std dev              1.001 μs   (803.7 ns .. 1.299 μs)

benchmarking sixth
time                 55.48 μs   (55.36 μs .. 55.65 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 55.57 μs   (55.46 μs .. 55.72 μs)
std dev              404.1 ns   (337.9 ns .. 502.8 ns)

Strangely, benchmarks third and sixth are much faster than the others, but this only happens in Variant A. The other two variants show very consistent performance of the function. Variant B gives durations that hover within a percentage point of 44.47 μs, and Variant C gives durations that hover within a percentage point of 83.07 μs.

With -ddump-simple, I've been able to confirm that (without a NOINLINE pragma) the loop gets inlined into six different places in the benchmarking code, once into each individual benchmark. What I don't understand is why the NCG handles this so inconsistently while LLVM is fine with it.

This whole issue may not seem like a big deal, but I ran into this by encountering some benchmark results that seemed wrong. That is, the implementation of an algorithm that was supposed to be faster ended up being slower. By simply reordering the benchmarks, I was able to observe very different behavior. Suddenly, the implementation was previously slower became faster. It took me a while to realize that the NCG was compiling simple loops differently depending on which cons cell of the _ : _ : _ : _ : _ : _ : [] they showed up in. I almost made some erroneous conclusions about the implementations I was comparing, but I happened to switch around the order of the benchmarks and realized that the conclusions I had previously reached were wrong.

Environment

GHC version used: 8.6.4

Optional:

Operating System: Ubuntu
System Architecture: x86-64

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information