Ability to see time exact compilation time of modules
Long compilation times are a challenge for large Haskell projects. After tricks like giving GHC more memory are exhausted, users have a few levers within their code to speed up compilation:
- Reducing the time it takes to compile a module (e.g. removing unnecessary
derivingstatements, deleting unused code, etc.)
- Improving parallelism, which currently is done at the module level.
- Improving caching via incremental builds (mostly unrelated to this feature request)
However, users don't have great insight into what modules are slow to compile, and where compilation is bottlenecked on fewer than the available CPU cores, leading to fruitless StackOverflow questions like this one: https://stackoverflow.com/questions/47444097/is-there-a-way-to-profile-compilation-time-per-module-with-ghc
As far as I can tell, the best information available right now is the output of
Parser [Model.PurposeCode]: alloc=2255040 time=10.326 Simplify [expr]: alloc=10824 time=0.116 ...
While not a friendly UX, you may be able to get the total time by manually adding up the time of each section (?). However, you still would not be able to tell if e.g. two modules were compiled in parallel, because you don't know what times they started at.
I think an easy solution for this is to print the total time a module took to compile, as well as the UTC time when compilation began and ended. I'm not super familiar with GHC's debugging options, but from the documentation
-ddump-timings would be a natural place for this information. This would match existing
-ddump output like
-ddump-simpl printing the exact UTC times of each simplification step.
This would enable two things:
- The total time per module would make it obvious which modules were slow to compile. Once the information was in
-ddump-timings, external tools (or just a small shell script) could parse that info and sort results by total time.
- The timestamp at the beginning and end of compilation could be used to know what modules are compiling at what time. This information would help determine when GHC is bottlenecked on a single module.
To elaborate on the timestamp use-case, imagine we're compiling these modules, with E depending on D, and D on A/B/C:
E | D / | \ A B C
If I see that modules A, B and C all run from 0–2s, I can tell that they are being compiled in parallel (yay!). If I then see that module D runs at 2–5s, followed by E at 5–6s, I can get a rough idea that I could speed up compilation by e.g. breaking up module D into parts, or trying to extract some of E to compile in parallel with D. One could use the timestamps to create a visualizations to aid in this:
A====|D=============|E==== B==== C====
Note that the static dependency graph information isn't sufficient to know what to improve, but is helpful. The compilation time could have easily been this instead:
A==== |D====|E==== B==== C==========
I imagine there are other ways this information could be exposed to users. For example, a profiled version of GHC might emit events that say when compilation of a module began and ended. This would probably be reasonable, but simple timestamps added to GHC's existing debug output seem like an easy, accessible solution.
I'm also not sure if CPU time (what GHC currently gets in its
withTiming function as far as I can tell) or UTC time (what is output in the -ddump-simpl output) is preferable.