Compilation time/space regression in GHC 8.0/8.1 (search in type-level lists and -O)

changed weight to 5

Attached file TypeList.hs ($1222).

Attached file Regression.hs ($1223).

Thanks for the great test case!

The results of putting NOINLINE on elemWitness:

$ time ~/Prefixes/ghc-7.10.3/bin/ghc -O -c -Rghc-timing TypeList.hs Regression.hs 
<<ghc: 4483083448 bytes, 546 GCs, 31794039/124563608 avg/max bytes residency (13 samples), 294M in use, 0.000 INIT (0.001 elapsed), 3.247 MUT (3.279 elapsed), 1.677 GC (1.753 elapsed) :ghc>>

real	0m5.092s
user	0m4.983s
sys	0m0.097s
$ time ~/Prefixes/ghc-8.0.1/bin/ghc -O -c -Rghc-timing TypeList.hs Regression.hs 
<<ghc: 25719759160 bytes, 2550 GCs, 51924484/183848928 avg/max bytes residency (16 samples), 510M in use, 0.000 INIT (0.000 elapsed), 19.883 MUT (20.050 elapsed), 3.770 GC (3.880 elapsed) :ghc>>

real	0m23.988s
user	0m23.747s
sys	0m0.190s

The interface file produced by both versions is absolutely huge (17mb). A very large number of coercions are generated by the instance declarations in Regression.hs.

Interestingly, one way to improve the performance on 7.10.3 is marking elemWitness as inlinable.. This stops the unfoldings being optimised but it doesn't bring performance to the same level as omitting unfoldings completely.

Several points here

We don't yet know why 8.0 is worse than 7.10. That would e helpful.
The coercions record a proof of all the list manipulations involved. It's not surprising that they can get big. What is surprising is why they are bigger with 8.0
GHC includes a "coercion optimiser" OptCoercion.hs, which tries to shrink a big coercion to a smaller but equivalent coercion. It's possible that 8.0 is somehow missing a crucial optimisation that fires in 7.10.
We also have a plan, recorded in Trac #8095, to discard coercions altogether. But neither Richard or I have found time to execute on that plan. See also #11598 (closed)

Bottom line: more insight into what is going on would be really helpful.

Here are some numbers from -v4 logs:

$ grep -E '(=====|^Result|coercions)' out-7_10.log | grep -v 'RHS size' | head -40
==================== Parser ====================
==================== Source Statistics ====================
==================== Occurrence analysis ====================
==================== Desugar (after optimization) ====================
Result size of Desugar (after optimization)
  = {terms: 2,144, types: 128,368, coercions: 415,496}
==================== Occurrence analysis ====================
Result size of Simplifier iteration=1
  = {terms: 3,200, types: 173,424, coercions: 415,528}
==================== Occurrence analysis ====================
Result size of Simplifier
  = {terms: 3,200, types: 173,424, coercions: 415,528}
==================== Specialise ====================
Result size of Specialise
  = {terms: 12,416, types: 658,320, coercions: 455,704}
==================== FloatOut stats: ====================
Result size of Float out(FOS {Lam = Just 0,
  = {terms: 13,472, types: 739,808, coercions: 455,704}
==================== Occurrence analysis ====================
Result size of Simplifier iteration=1
  = {terms: 30,336, types: 1,848,096, coercions: 6,417,128}
==================== Occurrence analysis ====================
Result size of Simplifier
  = {terms: 30,336, types: 1,848,096, coercions: 6,417,128}
==================== Occurrence analysis ====================
Result size of Simplifier
  = {terms: 30,336, types: 1,848,096, coercions: 6,417,128}
==================== Occurrence analysis ====================
Result size of Simplifier
  = {terms: 30,336, types: 1,848,096, coercions: 6,417,128}
Result size of Float inwards
  = {terms: 30,336, types: 1,848,096, coercions: 6,417,128}
==================== Called arity analysis ====================
Result size of Called arity analysis
  = {terms: 30,336, types: 1,848,096, coercions: 6,417,128}
==================== Occurrence analysis ====================

for 7.10.3 versus

$ grep -E '(=====|^Result|coercions)' out-8_0.log | grep -v 'RHS size' | head -40
==================== Parser ====================
==================== Source Statistics ====================
==================== Occurrence analysis ====================
==================== Desugar (after optimization) ====================
Result size of Desugar (after optimization)
  = {terms: 3,835, types: 147,262, coercions: 549,848}
==================== Occurrence analysis ====================
Result size of Simplifier iteration=1
  = {terms: 4,897, types: 161,681, coercions: 549,849}
==================== Occurrence analysis ====================
Result size of Simplifier
  = {terms: 4,897, types: 161,681, coercions: 549,849}
==================== Specialise ====================
Result size of Specialise
  = {terms: 17,793, types: 899,977, coercions: 3,094,577}
==================== FloatOut stats: ====================
Result size of Float out(FOS {Lam = Just 0,
  = {terms: 20,769, types: 1,109,785, coercions: 3,094,577}
==================== Occurrence analysis ====================
Result size of Simplifier iteration=1
  = {terms: 16,553, types: 871,953, coercions: 2,232,281}
==================== Occurrence analysis ====================
Result size of Simplifier
  = {terms: 16,429, types: 864,761, coercions: 2,184,448}
==================== Occurrence analysis ====================
Result size of Simplifier
  = {terms: 16,429, types: 864,761, coercions: 2,184,448}
==================== Occurrence analysis ====================
Result size of Simplifier iteration=1
  = {terms: 87,357, types: 4,366,893, coercions: 10,008,352}
==================== Occurrence analysis ====================
Result size of Simplifier
  = {terms: 87,357, types: 4,366,893, coercions: 10,008,352}
Result size of Float inwards
  = {terms: 87,357, types: 4,366,893, coercions: 10,008,352}
==================== Called arity analysis ====================
Result size of Called arity analysis
  = {terms: 87,357, types: 4,366,893, coercions: 10,008,352}
==================== Occurrence analysis ====================

for 8.0.1

This, to me, does not look like a cut-and-dried type-level issue. Note that the number of terms is much greater in 8.0. And the ratio of coercions to terms is lower in 8.0 than 7.10.

It looks like the occurrence analysis in 8.0 is to blame for the blowup in terms... whereas that same pass in 7.10 caused more of a blowup in coercions. Weird.

mentioned in issue #13429 (closed)

A sizable chunk of the regression came from commit 67465497 (Add kind equalities to GHC.)

Commit 6e56ac58a6905197412d58e32792a04a63b94d7e (Fix infix record field fixity (#11167 and #11173).)
-----
$ time ghc4/inplace/bin/ghc-stage2 -fforce-recomp -O -c -Rghc-timing TypeList.hs Regression.hs
<<ghc: 19608856112 bytes, 2265 GCs, 63045253/234165360 avg/max bytes residency (19 samples), 498M in use, 0.001 INIT (0.001 elapsed), 9.087 MUT (9.117 elapsed), 3.649 GC (3.647 elapsed) :ghc>>

real    0m12.829s
user    0m12.688s
sys     0m0.140s

Commit 6746549772c5cc0ac66c0fce562f297f4d4b80a2 (Add kind equalities to GHC.)
-----
$ time ghc4/inplace/bin/ghc-stage2 -fforce-recomp -O -c -Rghc-timing TypeList.hs Regression.hs
<<ghc: 68214058248 bytes, 5147 GCs, 71712416/267022672 avg/max bytes residency (20 samples), 586M in use, 0.001 INIT (0.001 elapsed), 29.850 MUT (29.871 elapsed), 4.806 GC (4.804 elapsed) :ghc>>

real    0m34.744s
user    0m34.572s
sys     0m0.176s

But that's not the full story, since I get a realtime of 1m12.849s with GHC 8.0.1. I'll try to figure out what other commit(s) are contributing.

Found it. The other culprit is commit 1722fa10 (Fix #11230 (closed).):

Commit ae86eb9f72fa7220fe47ac54d6d21395691c1308 (Fix tcTyClTyVars to handle SigTvs)
-----
$ time ghc4/inplace/bin/ghc-stage2 -fforce-recomp -O -c -Rghc-timing TypeList.hs Regression.hs
<<ghc: 68211633024 bytes, 4916 GCs, 73233972/211211872 avg/max bytes residency (20 samples), 571M in use, 0.001 INIT (0.001 elapsed), 28.299 MUT (28.320 elapsed), 4.831 GC (4.828 elapsed) :ghc>>

real    0m33.216s
user    0m33.040s
sys     0m0.176s

Commit 1722fa106e10e63160bb2322e2ccb830fd5b9ab3 (Fix #11230.)
-----
$ time ghc4/inplace/bin/ghc-stage2 -fforce-recomp -O -c -Rghc-timing TypeList.hs Regression.hs
<<ghc: 161224532608 bytes, 8789 GCs, 195013385/460659576 avg/max bytes residency (37 samples), 1272M in use, 0.001 INIT (0.001 elapsed), 66.710 MUT (66.753 elapsed), 18.669 GC (18.660 elapsed) :ghc>>

real    1m25.494s
user    1m25.252s
sys     0m0.260s

What's alarming is that that commit is so small! There must be something interesting happening there...

mentioned in commit a786b136

See this ghc-devs thread, and this one.

There seem to be several issues:

Why does the small change identified by Ryan have such a huge effect? We need insight.
The change that Richard made to CoreOpt.simple_opt_expr actually fixed an outright bug. But it's still wrong now. See Trac #13798.
The fuss that is caused by this map/coerce rule is out of all proportion to its importance. We should think of a better way.

Incidentally, HEAD is much much better:

7.10   19.2G allocated by GHC
       50 seconds

HEAD    7.1G allocated by GHC
        30 seconds

I gave up waiting for 8.0. So apparently the original problem has gone away, although I'd quite like to know how!

Well hot dang. It turns out that commit 2b74bd9d (Stop the specialiser generating loopy code) is once again responsible for fixing the problem!

Commit 92a4f908f2599150bec0530d688997f03780646e (Spelling typos)
<<ghc: 161228659408 bytes, 6342 GCs, 101585298/269563168 avg/max bytes residency (15 samples), 712M in use, 0.001 INIT (0.001 elapsed), 53.742 MUT (53.674 elapsed), 5.431 GC (5.420 elapsed) :ghc>>

real    0m59.150s
user    0m59.020s
sys     0m0.224s

Commit 2b74bd9d8b4c6b20f3e8d9ada12e7db645cc3c19 (Stop the specialiser generating loopy code)
<<ghc: 7146221432 bytes, 755 GCs, 30379988/108444792 avg/max bytes residency (10 samples), 240M in use, 0.001 INIT (0.001 elapsed), 3.583 MUT (3.603 elapsed), 0.890 GC (0.888 elapsed) :ghc>>

real    0m4.549s
user    0m4.440s
sys     0m0.096s

For reference, the time for GHC 7.10.3 was:

<<ghc: 19108000616 bytes, 1539 GCs, 62086227/176263896 avg/max bytes residency (20 samples), 453M in use, 0.001 INIT (0.001 elapsed), 9.188 MUT (9.210 elapsed), 3.974 GC (3.972 elapsed) :ghc>>

real    0m13.477s
user    0m13.128s
sys     0m0.104s

So now it's even //better// than it used to be! That's pretty cool.

I don't have any particular insight as to why that commit made things so much better, but sure enough, TypeList.hs contains some classes with significant superclasses:

class (FindElem 'HeadElem a l ~ 'Elem p) => ElemAt p (a :: k) (l :: [k]) where
  elemWitness :: Proxy a -> Proxy l -> ElemWitness p a l

class IsElem t (ElemsOf a) => ElemOf a t where

So it's entirely possible that their corresponding dictionaries are just getting optimized better. It would be great to have someone verify this at the Core level.

I've backported the above mentioned commit to GHC 8.0.2 and created bindist for Linux:

https://github.com/nh2/ghc/releases/tag/ghc-8.0.2-bugfix-12545-backport-2017-06-08

Trac metadata

Trac field	Value
Differential revisions	- → D3632

Trac field	Value
Version	8.0.1
Type	Bug
TypeOfFailure	OtherFailure
Priority	normal
Resolution	Unresolved
Component	Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture

Compilation time/space regression in GHC 8.0/8.1 (search in type-level lists and -O)

Child items ...

Activity

Compilation time/space regression in GHC 8.0/8.1 (search in type-level lists and -O)

Relates to

Activity