Performance Regression from GHC 8.10.7 to GHC 9.0.1 (and later)

Summary

In some situations, GHC 9.0.1, GHC 9.0.2, GHC 9.2.4, and GHC 9.4.2 all generate code that performs worse than the code generated by GHC 8.10.7.

Steps to reproduce

I have come up with the following minimal example:

import Data.Foldable
import Data.Vector
import Test.Tasty.Bench

{-# INLINE arithmean #-}
arithmean :: Foldable f => Fractional a => f a -> a
arithmean = g . Data.Foldable.foldl' f (0 :: Int, 0) where
    f (n, a) x = (n + 1, a + x)
    g (n, a) = a / fromIntegral n

main :: IO ()
main = defaultMain [bench "arithmean" $ whnf (arithmean . enumFromN (1 :: Double)) 1000000]

I run this with ghc -O2 Arithmean.hs && ./Arithmean +RTS -T. In GHC 9.0.1 (and later), this results in

arithmean: OK (0.30s)
  2.32 ms ± 178 μs,  15 MB allocated, 1.8 KB copied, 2.0 MB peak memory

Expected behavior

In GHC 8.10.7, this results in

arithmean: OK (0.22s)
  854  μs ±  51 μs,   0 B  allocated,   0 B  copied, 2.0 MB peak memory

It is much faster and performs no heap allocation.

Environment

GHC version used: 8.10.7, 9.0.1, 9.0.2, 9.2.4, 9.4.2

Optional:

Operating System: Arch Linux
System Architecture: x64

Investigation

I have looked at the core generated by different versions of GHC.

GHC 8.10.7

GHC 8.10.7 gives me the following worker and benchmark.

Rec {
-- RHS size: {terms: 31, types: 6, coercions: 0, joins: 0/0}
$s$wfoldlM'_loop
  = \ sc sc1 sc2 sc3 ->
      case ># sc 0# of {
        __DEFAULT ->
          case /## sc2 (int2Double# sc3) of wild2 { __DEFAULT -> D# wild2 };
        1# ->
          $s$wfoldlM'_loop
            (-# sc 1#) (+## sc1 1.0##) (+## sc2 sc1) (+# sc3 1#)
      }
end Rec }

Rec {
-- RHS size: {terms: 28, types: 27, coercions: 0, joins: 0/0}
main_$s$wbenchLoop1
  = \ sc sc1 sc2 ->
      case sc1 of wild {
        __DEFAULT ->
          case seq#
                 (case sc2 of { I# ww1 -> $s$wfoldlM'_loop ww1 1.0## 0.0## 0# }) sc
          of
          { (# ipv, ipv1 #) ->
          main_$s$wbenchLoop1 ipv (minusWord# wild 1##) sc2
          };
        0## -> (# sc, () #)
      }
end Rec }

The worker exclusively uses unboxed values and thus is very fast and performs no heap allocation. The benchmark simply calls the worker function.

GHC 9.0.1

GHC 9.0.1 gives me the following worker and benchmark.

Rec {
-- RHS size: {terms: 31, types: 6, coercions: 0, joins: 0/0}
$s$wfoldlM'_loop
  = \ sc sc1 sc2 sc3 ->
      case ># sc 0# of {
        __DEFAULT ->
          case /## sc2 (int2Double# sc3) of ww { __DEFAULT -> D# ww };
        1# ->
          $s$wfoldlM'_loop
            (-# sc 1#) (+## sc1 1.0##) (+## sc2 sc1) (+# sc3 1#)
      }
end Rec }

Rec {
-- RHS size: {terms: 71, types: 43, coercions: 0, joins: 1/1}
$s$wbenchLoop
  = \ sc sc1 sc2 ->
      case sc1 of wild {
        __DEFAULT ->
          case seq#
                 (case sc2 of { I# ww1 ->
                  joinrec {
                    $wfoldlM'_loop w ww2 ww3 ww4 ww5
                      = case w of { __DEFAULT ->
                        case ># ww5 0# of {
                          __DEFAULT ->
                            case /## ww3 (int2Double# ww2) of ww6 { __DEFAULT -> D# ww6 };
                          1# ->
                            case ww4 of { D# y ->
                            jump $wfoldlM'_loop
                              SPEC (+# ww2 1#) (+## ww3 y) (D# (+## y 1.0##)) (-# ww5 1#)
                            }
                        }
                        }; } in
                  jump $wfoldlM'_loop SPEC 0# 0.0## (D# 1.0##) ww1
                  })
                 sc
          of
          { (# ipv, ipv1 #) ->
          $s$wbenchLoop ipv (minusWord# wild 1##) sc2
          };
        0## -> (# sc, () #)
      }
end Rec }

The worker isomorphic up to renaming to the one generated by GHC 8.10.7. However, the benchmark does not actually use the generated worker! Indeed, the worker is only assigned to a top-level binding:

-- RHS size: {terms: 9, types: 3, coercions: 0, joins: 0/0}
eta
  = \ w ->
      case w of { I# ww1 -> $s$wfoldlM'_loop ww1 1.0## 0.0## 0# }

This top-level binding is then never used, which is very confusing to me.

Inside the benchmark function, GHC 9.0.1 generates a different version of the worker which boxes one of the parameters for no apparent reason, which explains the slowdown and the heap allocation.

I am very confused by the fact that GHC generates two different worker functions, one of which has additional boxing. I was unable to remove this boxing with strictness annotations.

The core generated by GHC 9.0.1 is also quite a bit larger than the one generated by GHC 8.10.7, although I do not know if this carries over to executable size.

GHC 9.4.2

The core generated looks very similar to the one generated by GHC 9.0.1. In particular, it also contains two different versions of the worker function, with the one that is actually used in the benchmark suffering from additional boxing.

Remarks

If other benchmarks are present in the same module, sometimes GHC 9.0.1 generates code that is as fast as the one generated by GHC 8.10.7. However, this behavior is very unpredictable and I have not been able to figure out what is going on or when it happens.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information