We've found that for cardano-node speculative evaluation introduced in the coreprep stage can cause a large increase in allocations. Some dictionaries get created a lot more because of this transformation.
I have a draft MR that allows speculative evaluation to be turned off (or turned off for DFun) !13556 (merged) . Perhaps we can change the flag for DFun to one that sets an upper bound on the amount of allocation we do.
We've found that for cardano-node speculative evaluation introduced in the coreprep stage can cause a large increase in allocations. Some dictionaries get created a lot more because of this transformation.
That is pretty strange. It implies that much of the time:
The dictionary is large, or expensive to construct
It is mostly 100% unused, not evaluated at all.
It would be very helpful to have more insight into what is going on, in case there really is a bug of some kind. If the increase is "enormous" it can't be hard to find!
If the dictionary is 100% unused, I wonder if it'd be helpful to inline or specialise the "outer shell" of the function, to inline the path that discards the dictionary -- if it is indeed in the outer shell.
The dictionaries being created are very large indeed (at least the ones that drew our attention in the ticky profiles). I'll collect a bit more information and report back.
It would indeed be great to have this reproduced in a minimal example.
Perhaps it is something like this:
We have a dictionaries $dA :: C A and $dB :: C B of 100 fields for class C.
We have instance (C a, C b) => C (a, b) where ..., giving rise to a DFun $dfPairC da db = MkC (op1 ... da .. db) (op2 ... da ... db) ....
We need to call a function which needs C (A,B), which is constructed using $dfPairC $dA $dB. Note that this expression has quite a small closure, because there are only 2 free variables.
We speculate the call to $dfPairC. Result: A value MkC (op1 ... $dA ... $dB) (op2 .. $dA .. $dB) ... etc. This leads to 100 closures with 2 free variables each! Quite wasteful.
I think we could prevent such waste by doing a quick analysis of the DFun.
Say, if there are more than 10 occurrences in distinct fields of one of the parameters, don't speculate.
But adding a flag as in !13556 (merged) is a simple solution as well.
I've experimented a bit with a size cutoff for the dictionary size vs the number of free vars in the closure, but the results weren't conclusive for our testcase (perhaps I made some mistake). Unfortunately our test code is rather complicated and a rest run takes 45 minutes.
I think it would be useful to have the flags "just in case", to be able to measure/mitigate the effect of the optimisation. My main concern is that the -fcoreprep-spec-eval-dictfun might be superseded by a numerical flag that controls some growth factor.
I marked the !13556 (merged) MR as draft since I'd like to also implement a numerical cutoff for speculating dfuns and get some feedback on that approach.
Here's testcase that's hopefully a somewhat faithful representation of what happens in our codebase.
Normally GHC tries very hard to partially apply polymorphic functions to the typeclass args so it doesn't have to recompute dictionaries. But if that fails (because of a higher ranked type in the test), speculative evaluation can be costly if the dictionary goes unused.
output:
...45681238A (spec) allocated: 3200710445681238B (no spec) allocated: 19207408
A allocates a lot more because it always creates the HasConst10 dictionary because of speculative evaluation.
Cls.hs
{-# LANGUAGE UndecidableInstances #-}moduleClswhereclassHasConstawhereconstVal::ainstanceCls.HasConstWordwhereconstVal=123instanceCls.HasConstIntwhereconstVal=456-- this class has a big dictionaryclassHasConst10awhereconstA::aconstInt1::a->IntconstInt1_=1constInt2::a->IntconstInt2_=2constInt3::a->IntconstInt3_=3constInt4::a->IntconstInt4_=4constInt5::a->IntconstInt5_=5constInt6::a->IntconstInt6_=6constInt7::a->IntconstInt7_=7constInt8::a->IntconstInt8_=8constInt9::a->IntconstInt9_=9instanceHasConsta=>HasConst10awhereconstA=constVal-- this doesn't use the big dictionary most of the timeprintConst::foralla.(Showa,HasConst10a)=>a->Int->IO()printConstx5000=print@aconstA>>print(constInt8x)printConst__=pure()
A.hs
{-# OPTIONS_GHC -fspec-eval #-}moduleA(testX)whereimportqualifiedCls-- this creates the big dictionary strictly because of speculative evaluationtestX::(Showa,Cls.HasConsta)=>a->Int->IO()testXab=Cls.printConstab
B.hs
{-# OPTIONS_GHC -fno-spec-eval #-}moduleB(testX)whereimportqualifiedCls-- this creates the big dictionary lazilytestX::(Showa,Cls.HasConsta)=>a->Int->IO()testXab=Cls.printConstab
where d, d1, d2, d3 are dictionaries and $df is a dictionary function arising from an instance decl.
You have -XStrictDicts switched off, otherwise you'd compute d regardless.
But speculative evaluation evaluates d strictly anyway, on the grounds that doing so is cheap.
In your case, however, speculative evaluation is not cheap, because
d has many, many components
d turns out not to be used; it is seldom evaluated
How much can $df d1 d2 d3 allocate? It'll allocate
A tuple for the dictionary itself
A closure for each component of the dictionary, capturing none, some, or all of d1, d2, d3.
And similarly for each of the dicionary's superclass dictionaries.
In your particular case the $df is
$df :: forall a. HasConst a => HasConst10 a
This function builds a 10-tuple. All its fields are constants except one, which is gotten from HasConst.
In allocation terms, a dictionary will almost always be bigger than a thunk, because each component of the dictionary will likely capture some or all of the free variables of the thunk. So maybe we should switch off speculative evaluation of dictionaries altogether? What is the perf hit of that?
I suppose we could get some measure of the size of a dictionary by looking at how many fields it has. Something like
dictSize :: PredType -> IntdictSize pred = length (classOpItems cls) + sum (map dictSize (classSCTheta cls)) where (cls,_) = getClassPredTys pred
This isn't quite good enough because of recursive superclasses; you need to bale out if you see the same class again. But it'd be a start.