I am the author of the cl3 library on Hackage. I have noticed a huge increase of compile time and memory use when testing 8.2.2 and 8.4.2. ghc-8.0.2 compiled in 4:17.33 using 3.5 GB. ghc-8.2.2 compiled in 26:40.15 using 32.8 GB. This is an increase of 6x in time and 9x in memory. This is not all bad, my nbody benchmark has improved about 35% between ghc-8.0.2 and ghc-8.4.2 so the increased compilation time and memory usage are producing much better runtime performance. I am interested if you could suggest some workarounds to help others compile on systems with less resources. I have 64GB memory in my system and would like to test out some -fno-* GHC Options. Could you point me in the right direction? The library is almost entirely pure functions. I am also interested in other options, like if there are ways to rewrite things to make it easier on the compiler or using NOINLINE on a trouble spot and how to find that trouble spot.
Trac metadata
Trac field
Value
Version
8.4.2
Type
Bug
TypeOfFailure
OtherFailure
Priority
normal
Resolution
Unresolved
Component
Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
A quick first thing to do is to run ghc with -v. It will print statistics about each core-to-core pass (size of the AST, and in recent versions memory consumption), and maybe you can spot one pass where the size of the AST drastically increases.
It seems that the Eq instance is at the core of the problem: replacing it with the following allows GHC to compile the module in a timely fashion, and compilation succeeds even with a heap limit of only 128M:
Further investigation reveals that it's not necessarily the Eq instance itself that causes trouble; commenting out various parts of the module allows me to make it blow up or not blow up even with the full original Eq in place. It seems that multiple factors contribute to the blowup, and removing enough of them "fixes" the problem.
You can make the process faster by removing methods from instance Floating Cl3 - it seems that every one of the 12 methods slows down the compilation by few seconds. For example, if you keep exp and log only it's 22 sec (and the root cause should still be present).
This is very useful. The reduced Cl3.hs with all Floating methods except exp and log removed gives a much cleaner result. Looking at -ddump-rule-firings output, here's counters of how often each rule fired on 8.0.2 and 8.4.3:
rule | 8.0 8.4--------------------------+------------*## | 127 44+## | 26 372^2/Integer | 55 55Class op - | 419 417Class op / | 15 8Class op * | 1698 1693Class op ** | 3 2Class op + | 737 734Class op abs | 5 5Class op atan2 | 1 1Class op cos | 7 5Class op cosh | 3 2Class op exp | 14 10Class op fromInteger | 13 8Class op fromRational | 4 2Class op log | 20 14Class op log1p | 6 4Class op negate | 109 106Class op $p1Floating | 25 16Class op $p1Fractional | 20 12Class op pi | 1 1Class op recip | 4 3Class op sin | 7 5Class op sinh | 3 2Class op sqrt | 14 14doubleFromInteger | 9 7SC:$clog0 | 6 0SC:$w$catan20 | 1 0
So it looks like a change to the way SpecConstrs are handled is preventing either specializations, or RULES that follow from them.
Compiling with -fno-spec-constr makes no difference performance wise, but gets rid of the SC:... rule firings, so those weren't the culprit after all. Investigating further.
I have found an undocumented flag: "-fno-worker-wrapper". When enabled the original code compiles in 43.20 seconds (37x improvement in time), and uses 660MB max (48x improvement in space) with GHC 8.4.2. All of the tests pass for the library and the benchmark still performs at the improved speed. It seems like the Worker/Wrapper Transformation is causing issues with this code. Why is that? Perhaps Worker/Wrapper shouldn't be run in certain circumstances. What do you think?
Sounds like a promising lead, if not the key to figuring this out. Worker/wrapper is generally a good thing to do; in theory, all it should do is turn arguments to recursive function calls closures, thus reducing the argument-passing overhead. But I can imagine that this would also get in the way of inlinings and RULES.
So what we're looking for is code that follows the pattern that makes it a candidate for worker/wrapper optimization (parameters passed through recursive calls unchanged), such that the worker/wrapper optimization breaks up expressions that might otherwise trigger useful inlinings or rewrites.
I've continued to investigate and found an alternate work around. I changed every $! in the code to a $ I would get the improved compile performance even with worker-wrapper active. I'm not sure what that means.
Since this program is tiny (my slightly simplified version has 1339 lines of
code) and reports pretty bad numbers I wanted to give this a try, and I think I
may have accidentally found a bug in GHC HEAD (a regression from 8.8.2). However
I couldn't reproduce the original issue reported here (numbers below).
(There are two modules in this program, first number in each cell is for the
first, second number is for the second)
O0
O1
O2
7.10.3
0:02.26 + 0:00.14
0:06.87 + 1:04.06
0:07.91 + 1:07.27
8.0.2
0:02.39 + 0:00.20
0:07.40 + 1:11.35
0:08.04 + 1:15.14
8.2.2
0:02.14 + 0:00.14
0:08.06 + 0:20.93
0:08.21 + 0:21.92
8.4.3
0:02.09 + 0:00.16
0.05.90 + 0:18.06
0:05.82 + 0:19.04
8.6.4
0:02.16 + 0:00.18
0:05.50 + 0:08.76
0:06.96 + 0:09.43
8.8.2
0:02.21 + 0:00.16
0:05.40 + 0:08.58
0:06.06 + 0:09.79
HEAD
0:02.24 + 0:00.15
0:05.60 + 28:29.66
0:05.99 + ???
While build times vary between versions until GHC HEAD it's still possible to
build this program even with optimizations. Starting with GHC HEAD it becomes
impossible to build this with GHC HEAD with -O1 and -O2.
Max residency
Note that I couldn't use the flags for more accurate max. residency generation
like -i0 -A256k -h because the process takes forever with those flags.
O0
O1
O2
7.10.3
61,450,496 + 16,328,992
175,227,848 + 1,043,963,296
178,227,848 + 1,184,121,656
8.0.2
75,040,232 + 23,291,632
187,169,120 + 1,178,903,625
207,405,512 + 1,274,237,376
8.2.2
66,634,928 + 12,522,160
155,637,520 + 388,455,400
153,829,288 + 369,651,232
8.4.3
60,512,912 + 21,139,800
115,509,920 + 357,125,808
104,963,224 + 324,197,224
8.6.4
78,645,816 + 20,654,504
133,772,968 + 164,819,488
121,994,368 + 169,405,352
8.8.2
62,489,576 + 14,364,224
136,131,096 + 144,275,584
153,169,016 + 170,175,712
HEAD
55,104,744 + 13,785,664
148,848,096 + more than 16G
???
So unlike the original report, I get worse residency in 7.10 and 8.0 than in
newer versions.
And yes, we need to fix GHC HEAD before the next release.
It'd be good if someone else would try to reproduce the numbers. I made two GHC
HEAD builds because I couldn't believe these numbers are real, but it's still
possible that I've made a mistake in my benchmarking.
discount; a "discount" based on its calling context
Then we compute (size - (discount * unfolding-keeness-factor)) and
see if that is less than unfolding-use-threshold. Its default value
is 1.5.
By itself that makes sense. The unfolding-keeness-factor multiplies
up the discount, and makes GHC more eager to inline.
But the discount is computed like this.
Suppose the function has firsr argument x, and the call looks like
f (K7 a b), where K7 is a data constructor from type with many
constructors. Then, suppose inside f's body we have
case x of K1 -> e1 K2 -> e2 .. K100 -> e100
Knowing that the argument is K7 would dramatically shrink
the code! Suppose the size of e1 is s1, of e2 is s2, etc.
Then we give a discount for x equal to
discount(x) = (s1 + s2 + ... + s100) - smax
where smax is the size of the largest of the ei.
The idea is that, if we know the contructor x is buidl with,
the expression will shrink to smax at worst, maybe less.
So that's a good discount to apply.
ALAS, the reasoning is utterly bogus if you first multiply
the dicount by 1.5! In this particular example I saw
That is, function body has size 5122, with a discount of 4354 if you
know the first argument. That's OK: many case alternatives will be discarded
within that 5122-sized body, if we know the first arg. That still leaves
a large residual body of size 750 or so.
But if we multiply the discount by 1.5, it becomes way bigger than 5122,
and we inline this vast function. This happens many times. Disaster.
What to do about it
It is clearly wrong to multiply the argument discounts; they are absolute
sizes. Better to say
(size - discount) <= threshold * keeness
But that's no different to setting a different threshold.
My conclusion: let's kill off -funfolding-keeness-factor entirely
and (to maintain something like the status quo) bump up the default unfolding-use
threshold a bit.
Of course that doesn't maintain exactly the status quo;
the calculation is different. So we need to do some experimentation.
I'll admit, I am a bit concerned about how this will affect simplification of existing, hand-optimised code. However, let's see what happens on head.hackage.
I'll admit, I am a bit concerned about how this will affect simplification of existing, hand-optimised code. However, let's see what happens on head.hackage.
The point is that currently we are manifestly doing something ridiculously stupid. We must stop doing that. It is difficult to do so in a way that guarantees not to affect anything sensible. Let's just try and see -- and advertise the change so that users can re-check their inner loops.