Performance regression from 8.8.4 to 8.10.2

Summary

I wrote a small piece of cellular automata code as a stencil array computation using the massiv library here.

The core of the logic is:

s1 :: Massiv.Array U Ix2 Position -> Int
s1 = countOccupied . (updateUntilEq update)

countOccupied ::
  Massiv.Array U Ix2 Position -> Int
countOccupied =
  Massiv.foldlS
  (\acc pos -> acc + (fromEnum (pos == Occupied)))
  0

updateUntilEq
  :: (Massiv.Array U Ix2 Position -> Massiv.Array U Ix2 Position)
  -> Massiv.Array U Ix2 Position
  -> Massiv.Array U Ix2 Position
updateUntilEq upFun =
  Massiv.iterateUntil
  (const (==))
  (const upFun)

update :: Massiv.Array U Ix2 Position -> Massiv.Array U Ix2 Position
update =
  Massiv.compute . Massiv.mapStencil border updateStencil

border :: Massiv.Border Position
border = Massiv.Fill Floor

updateStencil :: Massiv.Stencil Ix2 Position Position
updateStencil =
  Massiv.makeStencilDef Floor (Sz (3 :. 3)) (1 :. 1) $ \ get ->
    let
      chair =  get (0 :. 0)
      isOcc c = fromEnum . (== Occupied) <$> (get c)
      occ :: Massiv.Value Int
      !occ =
        (  isOcc (-1 :. -1) + isOcc (-1 :. 0) + isOcc (-1 :. 1) +
           isOcc ( 0 :. -1)                   + isOcc ( 0 :. 1) +
           isOcc ( 1 :. -1) + isOcc ( 1 :. 0) + isOcc ( 1 :. 1)
        )
    in
      if (unsafeCoerce chair) == Floor
        then pure Floor
        else liftA2 threshold chair occ
{-# inline updateStencil #-}

threshold :: Position -> Int -> Position
threshold !p !n | n == 0    = Occupied
               | n >= 4    = Empty
               | otherwise = p

{-# inline threshold #-}

This has the following performance across 8.8.4 and 8.10.2 (as measured with criterion):

ghc-8.8.4: llvm

benchmarking massiv-example/run
time                 11.48 ms   (11.40 ms .. 11.56 ms)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 11.57 ms   (11.52 ms .. 11.63 ms)
std dev              140.5 μs   (98.34 μs .. 200.3 μs)

ghc-8.8.4: ncg

benchmarking massiv-example/run
time                 27.48 ms   (26.73 ms .. 28.18 ms)
                     0.998 R²   (0.997 R² .. 1.000 R²)
mean                 26.90 ms   (26.69 ms .. 27.19 ms)
std dev              507.9 μs   (358.3 μs .. 715.2 μs)

ghc-8.10.2: llvm

benchmarking massiv-example/run
time                 77.22 ms   (74.31 ms .. 79.26 ms)
                     0.998 R²   (0.996 R² .. 1.000 R²)
mean                 74.52 ms   (73.65 ms .. 75.69 ms)
std dev              1.824 ms   (1.269 ms .. 2.846 ms)

ghc-8.10.2: ncg

benchmarking massiv-example/run
time                 82.77 ms   (79.52 ms .. 85.60 ms)
                     0.998 R²   (0.997 R² .. 1.000 R²)
mean                 81.42 ms   (80.50 ms .. 82.61 ms)
std dev              1.723 ms   (1.321 ms .. 2.185 ms)

[It is also striking here how much better llvm does on 8.8.4.]

The repo also includes the core output for each version here I have not tried running this code with 9.0 as that proved beyond my abilities with cabal.

Steps to reproduce

Clone repo
run:

cabal configure --with-compiler=[8.8.4 | 8.10.2]
cabal build
cabal run bench

Expected behavior

The Performance to be roughly similar for 8.8.4 and 8.10.2.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information