## Switching direct type family application to EqPred (~) prevents inlining in code using vector (10x slowdown)

```
selectVectorDestructive2 :: (PrimMonad m, VG.Vector v e, Ord e, Show e, VG.Mutable v ~ vm)
=> Int -> v e -> vm (PrimState m) e -> Int -> Int -> m () -- slow
selectVectorDestructive2 :: (PrimMonad m, VG.Vector v e, Ord e, Show e)
=> Int -> v e -> VG.Mutable v (PrimState m) e -> Int -> Int -> m () -- fast
```

These two functions are identical except one has `VG.Mutable v ~ vm`

as a constraint, the other one has it in the type signature right of the `=>`

.

The second function is 10x faster.

I would expect them to be equally fast.

The code of the functions is identical, I change only the type declaration.

The slowness happens because with the first function, inlining of primitives like `unsafeRead`

does not happen, and thus also it boxes the `Int#`

s back to `Int`

s when calling `unsafeRead`

.

In particular, in `-ddump-simpl`

, the slow version has

```
$wpartitionLoop2_rgEy
:: forall (m :: * -> *) (vm :: * -> * -> *) e.
(PrimMonad m, MVector vm e, Ord e) =>
vm (PrimState m) e
-> GHC.Prim.Int#
-> e
-> GHC.Prim.Int#
-> GHC.Prim.Int#
-> GHC.Prim.Int#
-> m Int
$wpartitionLoop2_rgEy
= \... ->
let {
endIndex_a7Ay :: Int
...
endIndex_a7Ay = GHC.Types.I# ww2_sfZn } in
...
(VGM.basicUnsafeRead
...
endIndex_a7Ay)
...
```

while the fast version has

```
$w$spartitionLoop2_rgUN
:: GHC.Prim.Int#
-> GHC.Prim.Int#
-> GHC.Prim.MutableByteArray# GHC.Prim.RealWorld
-> GHC.Prim.Int#
-> GHC.Prim.Int#
-> GHC.Prim.Int#
-> GHC.Prim.Int#
-> GHC.Prim.Int#
-> GHC.Prim.State# GHC.Prim.RealWorld
-> (# GHC.Prim.State# GHC.Prim.RealWorld, Int #)
...
$spartitionLoop2_rgUP
:: VG.Mutable VU.Vector (PrimState IO) Int
-> Int
-> Int
-> Int
-> Int
-> Int
-> GHC.Prim.State# GHC.Prim.RealWorld
-> (# GHC.Prim.State# GHC.Prim.RealWorld, Int #)
```

So with `VG.Mutable v (PrimState m) e`

in the type signature, GHC managed to inline + specialise it all the way down to concrete types (`VUM.MVector`

, `IO`

, and `Int`

as the element type), and consequently inline `basicUnsafeRead`

on unboxed `Int#`

.

But with `VG.Mutable v ~ vm`

, ghc keeps `vm (PrimState m) e`

all the way, passes around dictionaries, thus eventually cannot inline `basicUnsafeRead`

and re-packs already unboxed values, like `endIndex_a7Ay = GHC.Types.I# ww2_sfZn`

, before passing them into the non-inlined call of `basicUnsafeRead`

, thus making a tight loop allocate that normally wouldn't allocate.

Why might rewriting the type signature in such a trivial way make this happen?

I have tested this on GHC 8.0.2, GHC 8.2.2, and GHC 8.5 HEAD commit cc4677c36ee (edit: in which case I commented out the quickcheck and criterion related stuff because their deps don't build there yet).

Reproducer:

- https://github.com/nh2/haskell-quickselect-median-of-medians/blob/0efd6293e779fda2d864ec3d75329fb16b8af6d9/Main.hs\#L506
- Running instructions are in this commit message; for short:
`stack exec -- ghc -O --make Main.hs -rtsopts -ddump-simpl -dsuppress-coercions -fforce-recomp -ddump-to-file -fno-full-laziness && ./Main +RTS -sstderr`

- For that file, I have pregenerated
`-dverbose-core2core`

output here: https://github.com/nh2/haskell-quickselect-median-of-medians/tree/0efd6293e779fda2d864ec3d75329fb16b8af6d9/slowness-analysis - I originally just wanted to write a fast median-of-medians implementation on ZuriHac 2017, but got totally derailed by this performance problem. The version of it that I link here is a total mess because of me mauling it to track down this performance issue.

Trying to increase `-funfolding-use-threshold`

or `-funfolding-keeness-factor`

did not change the situation.