## Superclass `Monad m =>` makes program run 100 times slower

Hi! I've just encountered a very bizarre error.

### General description

The code:

```
class LayersFoldableBuilder__ t (layers :: [Type]) m where
buildLayersFold__ :: SomePtr -> m (Fold.Result t) -> m (Fold.Result t)
instance Monad m => LayersFoldableBuilder__ t '[] m where
buildLayersFold__ = \_ a -> a
{-# INLINE buildLayersFold__ #-}
instance ( MonadIO m
, Storable.Storable (Layer.Cons l ())
, Layer.StorableLayer l m
, LayerFoldableBuilder__ (EnabledLayer t l) t m l
, LayersFoldableBuilder__ t ls m )
=> LayersFoldableBuilder__ t (l ': ls) m where
buildLayersFold__ = \ptr mr -> do
let fs = buildLayersFold__ @t @ls ptr'
ptr' = Ptr.plusPtr ptr $ Layer.byteSize @l
layerBuild__ @(EnabledLayer t l) @t @m @l ptr $! fs mr
{-# INLINE buildLayersFold__ #-}
```

This is a typeclass `LayersFoldableBuilder__`

and ALL of its instances. Please note, that every instance has a `Monad m`

or `MonadIO m`

constraint. The program which uses this code heavily runs in 40ms. If we only add constraint `Monad m =>`

to the class definition:

```
class Monad m => LayersFoldableBuilder__ t (layers :: [Type]) m where
buildLayersFold__ :: SomePtr -> m (Fold.Result t) -> m (Fold.Result t)
```

The program runs in 3.5s , which is almost 100 times slower.

Unfortunatelly I do not have minimal example, but it is reproducible. It is a part of the Luna Language codebase: https://github.com/luna/luna-core/blob/60bf6130691c23e52b97b067b52becb8fdb0c72e/core/src/Data/Graph/Traversal/Scoped.hs\#L102

it was introduced in the commit 60bf6130691c23e52b97b067b52becb8fdb0c72e on branch static-layers . However, building this is simple: stack bench luna-core . After invoking it we see the described results.

### Why its important and what should we do to fix it ===

- I am writing this because I care about Haskell community. I want GHC and Haskell to be widely used. Right now, the only thing I hear from all companies around our company is that impredicative performance, even when following rules "how to write efficient code" is the biggest pain people have. Haskell is gathering attention - pure functional programming, immutability etc - are great. But it will not become a popular choice unless we care about predictive performance.
- Such performance changes are unacceptable when thinking about Haskell and GHC as production ready systems. We need a clear way how to write high performance Haskell without the need to benchmark every part of our programs even when refactoring things. GHC has enough information to discover that we want high performance here and there (even by looking at INLINE pragmas) and should warn us about lack of optimization. We should also have a way to force GHC to apply optimizations in particular places - for example by explicit marking code to be always specialized during compilation, so GHC would never fall back to dict-passing in such places. Such possibility would solve MANY related problems and user fears.
- The point 2 also applies to strictness. In my opinion, having more clear strictness resolution rules / tools is important. Right now the only way to know if strictness analysis did a good job and we are not constantly boxing / unboxing things is to read core, which is tedious and 99% of Haskell users do not even know how to do it (We've got 10 really, really good Haskellers here, 2 of them are capable of reading core, but not very fluently). I would love to chat more about these topics, because they are crucial for growing Haskell community and making Haskell more popular choice, which is waht we want, right? We don't want Haskell to be just a research project with "all its users being its authors" at the same time, am I

### What happens in core ===

I inspected core and have found that indeed, after adding the constraint, GHC does not apply all optimizations to the definitions. To be honest, I completely don't understand it, because the code uses everywhere explicit `INLINE`

pragma to be sure everything is optimized away in the compilation stage:

```
--------------------------------------------------------------------------------
SLOW, with Monad m =>
--------------------------------------------------------------------------------
-- RHS size: {terms: 5, types: 12, coercions: 4, joins: 0/0}
buildLayersFold__ [InlPrag=INLINE]
:: forall t (layers :: [*]) (m :: * -> *).
LayersFoldableBuilder__ t layers m =>
SomePtr -> m (Fold.Result t) -> m (Fold.Result t)
[GblId[ClassOp],
Arity=1,
Caf=NoCafRefs,
Str=<S,U>,
Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True,
Guidance=ALWAYS_IF(arity=1,unsat_ok=False,boring_ok=True)
Tmpl= \ (@ t_ao0O)
(@ (layers_ao0P :: [*]))
(@ (m_ao0Q :: * -> *))
(v_B1 [Occ=Once]
:: LayersFoldableBuilder__ t_ao0O layers_ao0P m_ao0Q) ->
v_B1
`cast` (Data.Graph.Traversal.Scoped.N:LayersFoldableBuilder__[0]
<t_ao0O>_N <layers_ao0P>_N <m_ao0Q>_N
:: (LayersFoldableBuilder__
t_ao0O layers_ao0P m_ao0Q :: Constraint)
~R# (SomePtr
-> m_ao0Q (Fold.Result t_ao0O)
-> m_ao0Q (Fold.Result t_ao0O) :: *))}]
buildLayersFold__
= \ (@ t_ao0O)
(@ (layers_ao0P :: [*]))
(@ (m_ao0Q :: * -> *))
(v_B1 :: LayersFoldableBuilder__ t_ao0O layers_ao0P m_ao0Q) ->
v_B1
`cast` (Data.Graph.Traversal.Scoped.N:LayersFoldableBuilder__[0]
<t_ao0O>_N <layers_ao0P>_N <m_ao0Q>_N
:: (LayersFoldableBuilder__
t_ao0O layers_ao0P m_ao0Q :: Constraint)
~R# (SomePtr
-> m_ao0Q (Fold.Result t_ao0O)
-> m_ao0Q (Fold.Result t_ao0O) :: *))
--------------------------------------------------------------------------------
FAST, without Monad m =>
--------------------------------------------------------------------------------
-- RHS size: {terms: 8, types: 25, coercions: 0, joins: 0/0}
Data.Graph.Traversal.Scoped.$p1LayersFoldableBuilder__
:: forall t (layers :: [*]) (m :: * -> *).
LayersFoldableBuilder__ t layers m =>
Monad m
[GblId[ClassOp],
Arity=1,
Caf=NoCafRefs,
Str=<S(SL),U(U,A)>,
RULES: Built in rule for Data.Graph.Traversal.Scoped.$p1LayersFoldableBuilder__: "Class op $p1LayersFoldableBuilder__"]
Data.Graph.Traversal.Scoped.$p1LayersFoldableBuilder__
= \ (@ t_ao0P)
(@ (layers_ao0Q :: [*]))
(@ (m_ao0R :: * -> *))
(v_B1 :: LayersFoldableBuilder__ t_ao0P layers_ao0Q m_ao0R) ->
case v_B1 of v_B1
{ Data.Graph.Traversal.Scoped.C:LayersFoldableBuilder__ v_B2
v_B3 ->
v_B2
}
-- RHS size: {terms: 8, types: 25, coercions: 0, joins: 0/0}
buildLayersFold__
:: forall t (layers :: [*]) (m :: * -> *).
LayersFoldableBuilder__ t layers m =>
SomePtr -> m (Fold.Result t) -> m (Fold.Result t)
[GblId[ClassOp],
Arity=1,
Caf=NoCafRefs,
Str=<S(LS),U(A,U)>,
RULES: Built in rule for buildLayersFold__: "Class op buildLayersFold__"]
buildLayersFold__
= \ (@ t_ao0P)
(@ (layers_ao0Q :: [*]))
(@ (m_ao0R :: * -> *))
(v_B1 :: LayersFoldableBuilder__ t_ao0P layers_ao0Q m_ao0R) ->
case v_B1 of v_B1
{ Data.Graph.Traversal.Scoped.C:LayersFoldableBuilder__ v_B2
v_B3 ->
v_B3
}
```