Lookup in type class dictionaries in tight loops harms performance
This is result of attempt to optimize vector in case when specialization doesn't happen. Consider following simple benchmark program:
import qualified Data.Vector.Generic as VG
meanNoInline :: (VG.Vector v Double) => v Double -> Double
{-# NOINLINE meanNoInline #-}
meanNoInline xs = VG.sum xs / fromIntegral (VG.length xs)
Benchmarking is done using CPU counter and here are results for GHC 8.10-9.8. Columns are allocations, CPU instructions, CPU cycles and branching instructions per array element
GHC ALLOC INS/elt CYC/elt BR/elt
8.10.7 48 76.0 21.9 13.2
9.2.8 48 76.0 22.1 13.2
9.4.8 48 76.0 22.1 13.2
9.6.3 48 93.9 33.0 18.2
9.8.1 48 93.9 32.0 18.2
There's something that happened between 9.4 and 9.6. Below is core for GHC 9.8:
-- GHC9.8
$wmeanNoInline
:: forall {v :: * -> *}. Vector v Double => v Double -> Double#
$wmeanNoInline
= \ (@(v :: * -> *))
($dVector :: Vector v Double)
(xs_sfLo :: v Double) ->
case xs_sfLo of vec { __DEFAULT ->
case basicLength @v @Double $dVector vec of
{ I# len ->
joinrec {
$s$wfoldlM'_loop :: Int# -> Double# -> Double#
$s$wfoldlM'_loop (sc_sfLM :: Int#) (sc1_sfLL :: Double#)
= case >=# sc_sfLM len of {
__DEFAULT ->
case basicUnsafeIndexM @v @Double $dVector vec (I# sc_sfLM) of
{ Box x_i2TI ->
case x_i2TI of { D# y_afku ->
jump $s$wfoldlM'_loop (+# sc_sfLM 1#) (+## sc1_sfLL y_afku)
}};
1# -> /## sc1_sfLL (int2Double# len)
}; } in
jump $s$wfoldlM'_loop 0# 0.0##
}}
meanNoInline
:: forall (v :: * -> *). Vector v Double => v Double -> Double
meanNoInline
= \ (@(v :: * -> *))
($dVector :: Vector v Double)
(xs_sfLo :: v Double) ->
case $wmeanNoInline @v $dVector xs_sfLo of ww_sfLs
{ __DEFAULT ->
D# ww_sfLs}
Core for 9.4 is almost same. Except worker wrapper is done differently, it passes basicUnsafeIndex
and basicLength
instead of complete type class dictionary. Here is type signature for worker function, I hope it would be enough:
-- GHC9.4
$wmeanNoInline
:: forall {v :: * -> *}.
(v Double -> Int)
-> (v Double -> Int -> Box Double)
-> v Double -> Double#
$wmeanNoInline
It seems that performance difference boils down to GHC9.8 performing lookup of type class method on each iteration. This doesn't seem to be specific to vector at all. Any tight loop using type class method will have performance degradation. It may be advantageous to float out lookup in a type class dictionary.
Unboxed parameters
Situations becomes more complicated if unboxed data types is involved. basicUnsafeIndex
takes index as Int
and promptly uses it. So as an optimization one could add another indexing method which takes Int#
as parameter and use it everywhere
basicUnsafeIndexM :: v a -> Int -> Box a
basicUnsafeIndexM# :: v a -> Int# -> Box a
And it's indeed an optimization for GHC<=9.4. ≈30% speedup, allocations go down from 48 (Box Double & Int) to 32 (Box Double). For GHC9.8 it's a pessimization: allocations go up 48→64 (Box Double + ???), execution time +65%. Core is almost same as above except basicUnsafeIndexM
is replaced with basicUnsafeIndexM#
I'm completely at loss on what is going on. Maybe case basicUnsafeIndexM# @v @Double $dVector vec sc_sfLt of
performs some extra work when compared to case basicUnsafeIndexM @v @Double $dVector vec (I# sc_sfLt) of
?..
Optional:
- Operating System: linux
- System Architecture: x86_64