Lookup in type class dictionaries in tight loops harms performance

This is result of attempt to optimize vector in case when specialization doesn't happen. Consider following simple benchmark program:

import qualified Data.Vector.Generic as VG

meanNoInline :: (VG.Vector v Double) => v Double -> Double
{-# NOINLINE meanNoInline #-}
meanNoInline xs = VG.sum xs / fromIntegral (VG.length xs)

Benchmarking is done using CPU counter and here are results for GHC 8.10-9.8. Columns are allocations, CPU instructions, CPU cycles and branching instructions per array element

    GHC    ALLOC  INS/elt  CYC/elt  BR/elt
 8.10.7       48     76.0     21.9    13.2
  9.2.8       48     76.0     22.1    13.2
  9.4.8       48     76.0     22.1    13.2
  9.6.3       48     93.9     33.0    18.2
  9.8.1       48     93.9     32.0    18.2

There's something that happened between 9.4 and 9.6. Below is core for GHC 9.8:

-- GHC9.8
$wmeanNoInline
  :: forall {v :: * -> *}. Vector v Double => v Double -> Double#
$wmeanNoInline
  = \ (@(v :: * -> *))
      ($dVector :: Vector v Double)
      (xs_sfLo :: v Double) ->
      case xs_sfLo of vec { __DEFAULT ->
      case basicLength @v @Double $dVector vec of
      { I# len ->
      joinrec {
        $s$wfoldlM'_loop :: Int# -> Double# -> Double#
        $s$wfoldlM'_loop (sc_sfLM :: Int#) (sc1_sfLL :: Double#)
          = case >=# sc_sfLM len of {
              __DEFAULT ->
                case basicUnsafeIndexM @v @Double $dVector vec (I# sc_sfLM) of
                { Box x_i2TI ->
                case x_i2TI of { D# y_afku ->
                jump $s$wfoldlM'_loop (+# sc_sfLM 1#) (+## sc1_sfLL y_afku)
                }};
              1# -> /## sc1_sfLL (int2Double# len)
            }; } in
      jump $s$wfoldlM'_loop 0# 0.0##
      }}

meanNoInline
  :: forall (v :: * -> *). Vector v Double => v Double -> Double
meanNoInline
  = \ (@(v :: * -> *))
      ($dVector :: Vector v Double)
      (xs_sfLo :: v Double) ->
      case $wmeanNoInline @v $dVector xs_sfLo of ww_sfLs
      { __DEFAULT ->
      D# ww_sfLs}

Core for 9.4 is almost same. Except worker wrapper is done differently, it passes basicUnsafeIndex and basicLength instead of complete type class dictionary. Here is type signature for worker function, I hope it would be enough:

-- GHC9.4
$wmeanNoInline
  :: forall {v :: * -> *}.
     (v Double -> Int)
  -> (v Double -> Int -> Box Double)
  -> v Double -> Double#
$wmeanNoInline

It seems that performance difference boils down to GHC9.8 performing lookup of type class method on each iteration. This doesn't seem to be specific to vector at all. Any tight loop using type class method will have performance degradation. It may be advantageous to float out lookup in a type class dictionary.

Unboxed parameters

Situations becomes more complicated if unboxed data types is involved. basicUnsafeIndex takes index as Int and promptly uses it. So as an optimization one could add another indexing method which takes Int# as parameter and use it everywhere

  basicUnsafeIndexM  :: v a -> Int  -> Box a
  basicUnsafeIndexM# :: v a -> Int# -> Box a

And it's indeed an optimization for GHC<=9.4. ≈30% speedup, allocations go down from 48 (Box Double & Int) to 32 (Box Double). For GHC9.8 it's a pessimization: allocations go up 48→64 (Box Double + ???), execution time +65%. Core is almost same as above except basicUnsafeIndexM is replaced with basicUnsafeIndexM#

I'm completely at loss on what is going on. Maybe case basicUnsafeIndexM# @v @Double $dVector vec sc_sfLt of performs some extra work when compared to case basicUnsafeIndexM @v @Double $dVector vec (I# sc_sfLt) of?..

Optional:

Operating System: linux
System Architecture: x86_64

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information