Code generated by GHC 8.10.1 is slower than code generated by GHC 8.6.5 (or 8.8.3)

Yikes! It would be good to know what the cause is here.

I will take a quick look.

It seems we allocate about 15% more with much higher residency to boot.

$ winpty cabal new-run phash-bench -w /e/ghc-8.10.1/bin/ghc.exe -- -n100 +RTS -s Warning: Unknown/unsupported 'ghc' version detected (Cabal 3.1.0.0 supports 'ghc' version < 8.10): E:/ghc-8.10.1/bin/ghc.exe is version 8.10.1 Resolving dependencies... Up to date benchmarking fileHash/cat.png benchmarking fileHash/frog.jpeg benchmarking fileHash/frog.png 19,486,559,856 bytes allocated in the heap 18,837,168 bytes copied during GC 3,863,392 bytes maximum residency (301 sample(s)) 507,040 bytes maximum slop 26 MiB total memory in use (0 MB lost due to fragmentation)

                                 Tot time (elapsed)  Avg pause  Max pause

Gen 0 17311 colls, 0 par 0.172s 0.097s 0.0000s 0.0001s Gen 1 301 colls, 0 par 0.094s 0.116s 0.0004s 0.0015s

TASKS: 10 (1 bound, 9 peak workers (9 total), using -N8)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT time 0.000s ( 0.001s elapsed) MUT time 6.594s ( 6.769s elapsed) GC time 0.266s ( 0.213s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 6.859s ( 6.983s elapsed)

Alloc rate 2,955,307,655 bytes per MUT second

Productivity 96.1% of total user, 96.9% of total elapsed

Andi@Horzube MINGW64 /e/tmp/phash $ winpty cabal new-run phash-bench -w /e/ghc-8.8.1/bin/ghc.exe -- -n100 +RTS -s Resolving dependencies... Up to date benchmarking fileHash/cat.png benchmarking fileHash/frog.jpeg benchmarking fileHash/frog.png 16,839,850,160 bytes allocated in the heap 272,864 bytes copied during GC 3,863,376 bytes maximum residency (301 sample(s)) 507,056 bytes maximum slop 3 MB total memory in use (0 MB lost due to fragmentation)

                                 Tot time (elapsed)  Avg pause  Max pause

Gen 0 14835 colls, 0 par 0.141s 0.086s 0.0000s 0.0001s Gen 1 301 colls, 0 par 0.156s 0.119s 0.0004s 0.0018s

TASKS: 10 (1 bound, 9 peak workers (9 total), using -N8)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT time 0.000s ( 0.001s elapsed) MUT time 5.344s ( 5.713s elapsed) GC time 0.297s ( 0.205s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 5.641s ( 5.918s elapsed)

Alloc rate 3,151,316,989 bytes per MUT second

Productivity 94.7% of total user, 96.5% of total elapsed

This likely rules out recent codegen changes for a start.

A little less then half of the increase in allocations seems to originate from readImage in the hip library.

But the module itself has a large number of code changes with the module at the core of phash resulting in 21k/25k lines of core in 8.8/8.10.

I can reduce the hip part of the regression to this:

{-# NOINLINE load #-}
load what = do
    !thing <- readImage what :: IO (Either String (Image RSU Y Double))
    Right !img <- pure thing
    writeImage "tmp.png" (img)
    return img

Sadly this still expands to 16k lines of Core which no obvious culprint. The code for 8.10 is generally smaller than 8.8 in the regressing cases. So I suspect we fail to specialize or inline things at one place or another. But its hard to say where from the volume of core generated alone.

So in the example above the regression actually stems from writeImage which isn't used in the original reproducer, but might have the same root case.

In particular we call it like this:

writeImage $s$fArrayVScse $s$fArrayRSUcse $s$fWritableImageOutputFormat load3

The specizalized class dictionaries we pass here are similar but not the same between 8.10 and 8.8.


-- 8.8
$s$fArrayVScse :: Array VS Y Double
$s$fArrayVScse
  = C:Array
      ($s$fArrayVScse_$s$fVectorVectora `cast` <Co:6>)
      ($s$fArrayVScse_$s$fMArrayVScse `cast` <Co:5>)
      $s$fArrayVScse_$s$fBaseArrayVScse
      $s$fArrayVScse_$s$fArrayVScse_$cmakeImage
      $s$fArrayVScse_$s$fArrayVScse_$cmakeImageWindowed
      lvl198_rOAR
      $s$fArrayVScse_$s$fArrayVScse_$cindex00
      $s$fArrayVScse_$s$fArrayVScse_$cmap
      $s$fArrayVScse_$s$fArrayVScse_$cimap
      $s$fArrayVScse_$s$fArrayVScse_$czipWith
      $s$fArrayVScse_$s$fArrayVScse_$cizipWith
      $s$fArrayVScse_$s$fArrayVScse_$ctraverse
      $s$fArrayVScse_$s$fArrayVScse_$ctraverse2
      $s$fArrayVScse_$s$fArrayVScse_$ctranspose
      $s$fArrayVScse_$s$fArrayVScse_$cbackpermute
      ($sfromListsVG `cast` <Co:15>)
      $s$fArrayVScse_$s$fArrayVScse_$c|*|
      $s$fArrayVScse_$s$fArrayVScse_$cfold
      $s$fArrayVScse_$s$fArrayVScse_$cfoldIx
      $s$fArrayVScse_$s$fArrayVScse_$ceq
      $s$fArrayVScse_$s$fArrayVScse_$ccompute
      (id `cast` <Co:10>)
      $s$fArrayVScse_$s$fArrayVScse_$ctoVector
      $s$fArrayVScse_$s$fArrayVScse_$cfromVector

-- 8.10
-- RHS size: {terms: 25, types: 7, coercions: 49, joins: 0/0}
$s$fArrayVScse :: Array VS Y Double
$s$fArrayVScse
  = C:Array
      ($s$fArrayVScse_$s$fVectorVectora `cast` <Co:6>)
      ($s$fArrayVScse_$s$fMArrayVScse `cast` <Co:5>)
      $s$fArrayVScse_$s$fBaseArrayVScse
      $s$fArrayVScse_$s$fArrayVScse_$cmakeImage
      $s$fArrayVScse_$s$fArrayVScse_$cmakeImageWindowed
      ($s$fArrayVScse_$cscalar1_rsO2 `cast` <Co:13>)
      $s$fArrayVScse_$s$fArrayVScse_$cindex00
      lvl215_rsOv
      lvl216_rsOw
      lvl217_rsOx
      lvl218_rsOy
      lvl219_rsOz
      lvl220_rsOA
      $s$fArrayVScse_$s$fArrayVScse_$ctranspose
      $s$fArrayVScse_$s$fArrayVScse_$cbackpermute
      ($sfromListsVG `cast` <Co:15>)
      $s$fArrayVScse_$s$fArrayVScse_$c|*|
      $s$fArrayVScse_$s$fArrayVScse_$cfold
      $s$fArrayVScse_$s$fArrayVScse_$cfoldIx
      $s$fArrayVScse_$s$fArrayVScse_$ceq
      $s$fArrayVScse_$s$fArrayVScse_$ccompute
      (id `cast` <Co:10>)
      $s$fArrayVScse_$s$fArrayVScse_$ctoVector
      $s$fArrayVScse_$s$fArrayVScse_$cfromVector

In 8.8 we get for the map method this fairly specialized method:

-- RHS size: {terms: 133, types: 316, coercions: 38, joins: 0/4}
$s$fArrayVScse_$s$fArrayVScse_$cmap
  :: forall cs' e'.
     Array VS cs' e' =>
     (Pixel cs' e' -> Pixel Y Double)
     -> Image VS cs' e' -> Image VS Y Double
$s$fArrayVScse_$s$fArrayVScse_$cmap
$s$fArrayVScse_$s$fArrayVScse_$cmap
  = \ (@ cs'_ajpv)
      (@ e'_ajpw)
      ($dArray_ajpx :: Array VS cs'_ajpv e'_ajpw)
      (eta_B2 :: Pixel cs'_ajpv e'_ajpw -> Pixel Y Double)
      (eta1_B1 :: Image VS cs'_ajpv e'_ajpw) ->
      case eta1_B1 `cast` <Co:7> of { VGImage dt_ajpE dt1_ajpF v1_ajpG ->
      case $p1Array $dArray_ajpx of
      { C:Vector ww1_sDn2 ww2_sDn3 ww3_sDn4 ww4_sDn5 ww5_sDn6 ww6_sDn7
                 ww7_sDn8 ww8_sDn9 ->
      case ww4_sDn5 (v1_ajpG `cast` <Co:6>) of { I# ipv_aipC ->
      case runRW#
        -- actual code here likely irrelevant

We do have to unpack $dArray_ajpx once but that seems not so bad.

In 8.10 we get instead:

-- RHS size: {terms: 3, types: 2, coercions: 0, joins: 0/0}
lvl215_rsOv
  :: forall cs' e'.
     Array VS cs' e' =>
     (Pixel cs' e' -> Pixel Y Double)
     -> Image VS cs' e' -> Image VS Y Double
lvl215_rsOv
  = $fArrayVScse_$cmap
      $s$fArrayVScse_$s$fMArrayVScse $s$fArrayVScse_$s$fBaseArrayVScse

Which constructs a map function by passing dictionaries to a less specialized version of map ($fArrayVScse_$cmap). The dictionaries we pass show the same pattern again and the pattern repeats for a few levels of nesting.

In the end (for the read >>= write case) I looked at we "only" allocate an additional 6 words per pixel. I strongly suspect the stronger specialization allows ww to happen somewhere where it doesn't with 8.10.

I did not track down where exactly this happens.

@sgraf812 I know you looked into things related to specialization, do you have any insights on changes which might have caused this change in behaviour?

I bundled the read/write regression repro in this repo: https://github.com/AndreasPK/hip-bench.git for whoever wants to dig deeper. I lack the experience with the specializer to dig deeper at this time.

@lexi.lambda Alexis King is deep into the specialser right now. Alexis, might you be able to look at this? The perf regression is quite bad.

added specialisation label

I have taken a brief look. I haven’t worked out all the details yet, but I have found a bit of a smoking gun: on GHC 8.10, basicUnsafeNew fails to specialize where GHC 8.8 specializes it. What’s so interesting about basicUnsafeNew? It has the following definition:

class MVector v a where
  ...
  basicUnsafeNew :: PrimMonad m => Int -> m (v (PrimState m) a)
  ...

Why is this interesting? Because its type has nested foralls:

basicUnsafeNew :: forall v a. MVector v a => forall m. PrimMonad m => Int -> m (v (PrimState m) a)

This means that under GHC 8.8, we actually generate the following specialization:

$s$fMVectorMVectorPixel_$cbasicUnsafeNew
  :: forall m. PrimMonad m => Int -> m (MVector (PrimState m) (Pixel Y Double))

Note that m is still polymorphic. The changes in !668 (merged) are specifically designed to handle cases like these differently, since sometimes these added big lambdas would cause specialization to fail. My hunch is that those changes also unintentionally broke specializing functions like basicUnsafeNew, but I haven’t verified that hunch yet. More to follow, most likely!

Ha! That is indeed interesting!

So in some ways 8.10 can specialise more because it can look through an arrow (->) to find another (=>). But in some ways you have discovered that it specialises less because it insists on having a witness for each dictionary argument. That is an unexpected consequence of your patch.

Consider

f :: forall a b. (Eq a, Ord b) => blah
f = <f_body>

g :: forall b Ord b => blah
g = /\b \(d:Ord b). ...(f @Int @b dEqInt d)...

We'd like to specialise that call to f, even though one of its dictionary arguments isn't in scope at f's definition site:

$sf :: forall b. Ord b => blah[Int/a]
$sf = /\b \(d:Ord b). <f_body> @Int @b @dEqInt d

RULE forall b (dInt :: Eq Int) (dOrd :: Ord b).
       f @Int b dInt dOrd = $sf b dOrd

We even have the mechanism available: when d goes out of scope, change SpecDict d to UnspecArg.

So in some ways 8.10 can specialise more because it can look through an arrow (->) to find another (=>). But in some ways you have discovered that it specialises less because it insists on having a witness for each dictionary argument. That is an unexpected consequence of your patch.

Yes, precisely! (Well, except the bit about it being my patch—I can take credit for neither the improvement nor the regression. )

We even have the mechanism available: when d goes out of scope, change SpecDict d to UnspecArg.

Unfortunately, that doesn’t completely solve the problem on its own. Consider this small program, which reproduces the issue:

{-# LANGUAGE FlexibleInstances, MultiParamTypeClasses #-}

module T17966 where

class C a b where
  m :: Show c => a -> b -> c -> String

instance Show b => C Bool b where
  m a b c = show a ++ show b ++ show c
  {-# INLINABLE [0] m #-}

f :: (C a b, Show c) => a -> b -> c -> String
f a b c = m a b c ++ "!"
{-# INLINABLE [0] f #-}

x :: String
x = f True () (Just 42)

GHC 8.8 specializes both f and m (though it only specializes m at its first two arguments), but GHC 8.10 only specializes f. If we desugar the definition of x, it becomes apparent why:

x = f ($cm $fShow()) ($fShowMaybe $fShowInteger) True () (Just 42)

Here we pass $cm $fShow() as a dictionary, and that in isolation is a specializable call according to GHC 8.8. But on GHC 8.10, it isn’t that the second dictionary for $cm is locally-bound, we don’t even have it at all! So we still have to generate call info even if the call looks like it isn’t fully-saturated.

An alternative approach to consider is that, after specializing f, we end up with a call to $cm applied to two dictionaries, not just one, but we don’t immediately see it because we floated $cm $fShow() out. I don’t know if it’s worth arranging to discover that case, but its existence was surprising to me to discover.

Ah yes, I was conflating two conversations -- it was Sandy Maguire in this (excellent) patch

commit 2d0cf6252957b8980d89481ecd0b79891da4b14b
Author: Sandy Maguire <sandy@sandymaguire.me>
Date:   Thu May 16 12:12:10 2019 -0400

    Let the specialiser work on dicts under lambdas
    
    Following the discussion under #16473, this change allows the
    specializer to work on any dicts in a lambda, not just those that occur
    at the beginning.
    
    For example, if you use data types which contain dictionaries and
    higher-rank functions then once these are erased by the optimiser you
    end up with functions such as:
    
    ```
      go_s4K9
      Int#
      -> forall (m :: * -> *).
         Monad m =>
         (forall x. Union '[State (Sum Int)] x -> m x) -> m ()
    ```
    
    The dictionary argument is after the Int# value argument, this patch
    allows `go` to be specialised.

we don’t even have it at all!

It took me some while to grok this, but what you mean is this. In a more extreme version we might have this:

f :: forall a b. (Eq a, Ord b) => blah

g = ...map (f ty1 ty2 dEq)...

So f is applied to one dictionary, but not two. (In your test case the forall is around the other way.)

This example and the one I gave earlier are closely related; I could equally have written

g = ...map (\dOrd. f ty1 ty2 dEq dOrd)...

and now the call is saturated but with a dictionary that goes out of scope, in my previous comment.

Hmm. There is actually no reason for us to require that a call is saturated, in order to specialise it. Provided we have some SpecDict arguments, we can specialise regardless. Is that an opportunity you feel inclined to take up?

So f is applied to one dictionary, but not two. (In your test case the forall is around the other way.)

Yes, that’s right.

This example and the one I gave earlier are closely related; I could equally have written

That is true, but note that, at the Haskell level, doing this eta expansion is somewhat more involved. (Is it even possible to do, or will the simple optimizer take it back out again before the specializer sees it?)

Hmm. There is actually no reason for us to require that a call is saturated, in order to specialise it. Provided we have some SpecDict arguments, we can specialise regardless. Is that an opportunity you feel inclined to take up?

I don’t know! I think I understand what’s going on, but I have no idea what the tradeoffs are in terms of specialization of real programs. (In all honesty, all this partial specialization business makes me a little uncomfortable as it is; GHC’s optimizer already generates a lot of code, and this seems like it can only make that worse.)

So I don’t think I have a horse in this race, at least not yet. The only reason I bring it up at all is that it’s something GHC 8.8 did that 8.10 does not.

Yes, specialisation can generate a lot of code, but it's extraordinarily effective in making programs run faster! As indeed we see here.

Re eta, I was only suggesting that the two are related; I wasn't suggesting doing eta-expansion to make them the same.

Let's hope someone has the bandwidth to take this up. I hate the idea that 8.10 programs in the wild are running slower.

Yes, specialisation can generate a lot of code, but it's extraordinarily effective in making programs run faster!

Indeed! I wouldn’t want to imply otherwise. (I’m more worried about the potential pitfalls of overlapping specializations than of the concept of partial specialization in general.)

Let's hope someone has the bandwidth to take this up. I hate the idea that 8.10 programs in the wild are running slower.

I will consider looking into how hard it would be to fix after !2913 (closed) is finished. If it turns out to be a lot of work, I don’t want to commit to it (I’ve been working on !2913 (closed) on my own time, and it’s already eaten up a lot of that!). But if it’s easy, I probably might as well do it, seeing as I already have the relevant context in my head.

Actually I think it's easy. Maybe even dead easy. But I have been wrong before :-)

I'm on this.

Code generated by GHC 8.10.1 is slower than code generated by GHC 8.6.5 (or 8.8.3)

Summary

Steps to reproduce

Expected behavior

Environment

Child items ...

Activity