runghc HelloWorld.hs is quite slow

changed milestone to %8.10.1

According to -v3 vanishingly little of this time is spent in actual compilation.

`ghc -v3` output

$ time runghc -v3 hi.hs 
Glasgow Haskell Compiler, Version 8.6.4, stage 2 booted by GHC version 8.2.2
Using binary package database: /nix/store/d4c9yp4w96g45fivnljg32zjrvdfp10h-ghc-8.6.4/lib/ghc-8.6.4/package.conf.d/package.cache
Using binary package database: /home/ben/.ghc/x86_64-linux-8.6.4/package.conf.d/package.cache
package flags []
loading package database /nix/store/d4c9yp4w96g45fivnljg32zjrvdfp10h-ghc-8.6.4/lib/ghc-8.6.4/package.conf.d
loading package database /home/ben/.ghc/x86_64-linux-8.6.4/package.conf.d
wired-in package ghc-prim mapped to ghc-prim-0.5.3
wired-in package integer-gmp mapped to integer-gmp-1.0.2.0
wired-in package base mapped to base-4.12.0.0
wired-in package rts mapped to rts
wired-in package template-haskell mapped to template-haskell-2.14.0.0
wired-in package ghc mapped to ghc-8.6.4
package flags []
loading package database /nix/store/d4c9yp4w96g45fivnljg32zjrvdfp10h-ghc-8.6.4/lib/ghc-8.6.4/package.conf.d
loading package database /home/ben/.ghc/x86_64-linux-8.6.4/package.conf.d
wired-in package ghc-prim mapped to ghc-prim-0.5.3
wired-in package integer-gmp mapped to integer-gmp-1.0.2.0
wired-in package base mapped to base-4.12.0.0
wired-in package rts mapped to rts-1.0
wired-in package template-haskell mapped to template-haskell-2.14.0.0
wired-in package ghc mapped to ghc-8.6.4
*** Parser [source]:
!!! Parser [source]: finished in 0.19 milliseconds, allocated 0.182 megabytes
*** Desugar:
*** Simplify [expr]:
!!! Simplify [expr]: finished in 0.22 milliseconds, allocated 0.124 megabytes
*** CorePrep [expr]:
!!! CorePrep [expr]: finished in 2.66 milliseconds, allocated 1.820 megabytes
*** ByteCodeGen [Ghci1]:
!!! ByteCodeGen [Ghci1]: finished in 0.14 milliseconds, allocated 0.135 megabytes
Loading package ghc-prim-0.5.3 ... linking ... done.
Loading package integer-gmp-1.0.2.0 ... linking ... done.
Loading package base-4.12.0.0 ... linking ... done.
Search directories (user):
Search directories (gcc):
*** Parser [source]:
!!! Parser [source]: finished in 0.05 milliseconds, allocated 0.050 megabytes
*** Desugar:
*** Simplify [expr]:
!!! Simplify [expr]: finished in 0.16 milliseconds, allocated 0.069 megabytes
*** CorePrep [expr]:
!!! CorePrep [expr]: finished in 0.05 milliseconds, allocated 0.022 megabytes
*** ByteCodeGen [Ghci1]:
!!! ByteCodeGen [Ghci1]: finished in 23.43 milliseconds, allocated 0.082 megabytes
*** Parser [source]:
!!! Parser [source]: finished in 0.12 milliseconds, allocated 0.085 megabytes
*** Desugar:
*** Simplify [expr]:
!!! Simplify [expr]: finished in 0.18 milliseconds, allocated 0.076 megabytes
*** CorePrep [expr]:
!!! CorePrep [expr]: finished in 0.08 milliseconds, allocated 0.022 megabytes
*** ByteCodeGen [Ghci1]:
!!! ByteCodeGen [Ghci1]: finished in 0.19 milliseconds, allocated 0.098 megabytes
*** Chasing dependencies:
Chasing modules from: 
!!! Chasing dependencies: finished in 0.04 milliseconds, allocated 0.019 megabytes
Stable obj: []
Stable BCO: []
unload: retaining objs []
unload: retaining bcos []
Ready for upsweep []
Upsweep completely successful.
*** Deleting temp files:
Deleting: 
*** Chasing dependencies:
Chasing modules from: *hi.hs
!!! Chasing dependencies: finished in 0.28 milliseconds, allocated 0.129 megabytes
Stable obj: []
Stable BCO: []
unload: retaining objs []
unload: retaining bcos []
Ready for upsweep
  [NONREC
      ModSummary {
         ms_hs_date = 2019-06-14 16:16:03.014452338 UTC
         ms_mod = Main,
         ms_textual_imps = [(Nothing, Prelude)]
         ms_srcimps = []
      }]
*** Deleting temp files:
Deleting: 
compile: input file hi.hs
*** Checking old interface for Main (use -ddump-hi-diffs for more details):
[1 of 1] Compiling Main             ( hi.hs, interpreted )
*** Parser [Main]:
!!! Parser [Main]: finished in 0.07 milliseconds, allocated 0.030 megabytes
*** Renamer/typechecker [Main]:
!!! Renamer/typechecker [Main]: finished in 1.51 milliseconds, allocated 0.710 megabytes
*** Desugar [Main]:
Result size of Desugar (before optimization)
  = {terms: 15, types: 8, coercions: 0, joins: 0/1}
Result size of Desugar (after optimization)
  = {terms: 13, types: 6, coercions: 0, joins: 0/0}
!!! Desugar [Main]: finished in 0.75 milliseconds, allocated 0.158 megabytes
*** Simplifier [Main]:
Result size of Simplifier iteration=1
  = {terms: 21, types: 10, coercions: 0, joins: 0/0}
Result size of Simplifier
  = {terms: 21, types: 10, coercions: 0, joins: 0/0}
!!! Simplifier [Main]: finished in 0.59 milliseconds, allocated 0.211 megabytes
*** CoreTidy [Main]:
Result size of Tidy Core
  = {terms: 21, types: 10, coercions: 0, joins: 0/0}
!!! CoreTidy [Main]: finished in 0.22 milliseconds, allocated 0.080 megabytes
*** CorePrep [Main]:
Result size of CorePrep
  = {terms: 23, types: 11, coercions: 0, joins: 0/1}
!!! CorePrep [Main]: finished in 0.20 milliseconds, allocated 0.071 megabytes
*** ByteCodeGen [Main]:
!!! ByteCodeGen [Main]: finished in 0.15 milliseconds, allocated 0.080 megabytes
Created temporary directory: /tmp/ghc31856_0
Upsweep completely successful.
*** Deleting temp files:
Deleting: /tmp/ghc31856_0/ghc_1.c
Warning: deleting non-existent /tmp/ghc31856_0/ghc_1.c
Ok, one module loaded.
*** Parser [source]:
!!! Parser [source]: finished in 0.10 milliseconds, allocated 0.058 megabytes
*** Desugar:
*** Simplify [expr]:
!!! Simplify [expr]: finished in 2.24 milliseconds, allocated 0.076 megabytes
*** CorePrep [expr]:
!!! CorePrep [expr]: finished in 0.08 milliseconds, allocated 0.022 megabytes
*** ByteCodeGen [Ghci1]:
!!! ByteCodeGen [Ghci1]: finished in 0.16 milliseconds, allocated 0.097 megabytes
*** Parser [source]:
!!! Parser [source]: finished in 0.04 milliseconds, allocated 0.041 megabytes
*** Desugar:
*** Simplify [expr]:
!!! Simplify [expr]: finished in 0.13 milliseconds, allocated 0.085 megabytes
*** CorePrep [expr]:
!!! CorePrep [expr]: finished in 0.06 milliseconds, allocated 0.028 megabytes
*** ByteCodeGen [Ghci1]:
!!! ByteCodeGen [Ghci1]: finished in 0.15 milliseconds, allocated 0.115 megabytes
hello world
Leaving GHCi.
*** Deleting temp files:
Deleting: 
*** Deleting temp dirs:
Deleting: /tmp/ghc31856_0
real	0m0.246s
user	0m0.196s
sys	0m0.045s

A significant amount of time is also spent in various failed openat calls (from the dynamic linker?),

$ strace -c -S time runghc hi.hs
hello world
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 51.62    0.023834          18      1280      1174 openat
 14.39    0.006644          13       481       390 stat
 13.98    0.006455          21       302           mmap
  5.24    0.002418          12       194           brk
  3.01    0.001391           9       152           read
  2.45    0.001129          12        90           mprotect
  2.05    0.000947           7       123           fstat
  1.81    0.000834           7       106           close
  0.97    0.000450           7        64           rt_sigaction
...
------ ----------- ----------- --------- --------- ----------------
100.00    0.046175                  3418      1608 total

Indeed nearly all of the openat calls and many of the stat calls are due to the system's dynamic linker:

$ grep openat log | grep '.so' | wc -l
1233

This would be fixed by #11587 (closed).

Another thing to consider: the above measurements were taken after I had deleted the default package environment file populated by cabal-install. Before I had done that the runtime was nearly double what I report above.

We should try to take a closer look at this as part of our focus on compiler perf; I suspect there is some low-hanging fruit here.

My next characterisation effort will be to build a compiler from master with -eventlog and see what the tracing added in !783 (merged) says.

assigned to @bgamari

The RPATH size reduction idea was implemented in Hadrian in #11587 (closed). The number of failed openat calls is reduced dramatically as a result. Sadly this made little difference in overall runtime.

I ended up patching GHC to instrument all calls to external programs in compiler/main/SysTools/Task.hs, in addition to what Ben did in !783 (merged), to see how much time is spent calling the C compiler, as, the linker, etc. Here is what I saw on my machine:

$ time _build/stage1/bin/ghc -O0 -ddump-timings hi.hs +RTS -l -s
*** Chasing dependencies:
Chasing dependencies: alloc=207992 time=0.465
[1 of 1] Compiling Main             ( hi.hs, hi.o )
*** Parser [Main]:
Parser [Main]: alloc=59768 time=0.108
*** Renamer/typechecker [Main]:
Renamer/typechecker [Main]: alloc=14212440 time=33.033
*** Desugar [Main]:
Desugar [Main]: alloc=124584 time=2.494
*** Simplifier [Main]:
Simplifier [Main]: alloc=142424 time=0.212
*** CoreTidy [Main]:
CoreTidy [Main]: alloc=1774264 time=2.719
*** CorePrep [Main]:
CorePrep [Main]: alloc=10256 time=0.030
*** CodeGen [Main]:
CodeGen [Main]: alloc=1365856 time=1.635
Linking hi ...
      55,739,352 bytes allocated in the heap
      35,411,920 bytes copied during GC
       5,975,472 bytes maximum residency (5 sample(s))
         123,472 bytes maximum slop
               5 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0        40 colls,     0 par    0.021s   0.021s     0.0005s    0.0024s
  Gen  1         5 colls,     0 par    0.060s   0.060s     0.0120s    0.0170s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.030s  (  0.472s elapsed)
  GC      time    0.082s  (  0.081s elapsed)
  EXIT    time    0.000s  (  0.006s elapsed)
  Total   time    0.112s  (  0.560s elapsed)

  Alloc rate    1,886,295,850 bytes per MUT second

  Productivity  26.4% of total user, 84.3% of total elapsed


real	0m0.585s
user	0m0.466s
sys	0m0.118s

$ ghc-events-analyze --timed --timed-txt --totals --start "GHC:started:" --stop "GHC:finished:" ghc.eventlog 
Generated ghc.totals.txt using default script
Generated ghc.timed.svg using default script
Generated ghc.timed.txt using default script

Totals:

GC                             81362341ns   0.081s

USER EVENTS (user events are corrected for GC)
linker                        305198911ns   0.305s
cc                            108269510ns   0.108s
as                             20055542ns   0.020s
 Renamer/typechecker [Main]     8009066ns   0.008s
 CodeGen [Main]                 1437198ns   0.001s
 CoreTidy [Main]                1337953ns   0.001s
 Desugar [Main]                  476858ns   0.000s
 Chasing dependencies            464956ns   0.000s
 Simplifier [Main]               212261ns   0.000s
 Parser [Main]                   109003ns   0.000s
 CorePrep [Main]                  46051ns   0.000s
TOTAL                         445617309ns   0.446s

THREAD EVENTS
1                                 90605ns   0.000s
IOManager on cap 0:2             198815ns   0.000s
TimerManager:3                    43932ns   0.000s
4                              25742258ns   0.026s
weak finalizer thread:5           26771ns   0.000s
weak finalizer thread:6             994ns   0.000s
weak finalizer thread:7            5910ns   0.000s
weak finalizer thread:8             688ns   0.000s
weak finalizer thread:9             604ns   0.000s
weak finalizer thread:10           4005ns   0.000s
weak finalizer thread:11           4299ns   0.000s
weak finalizer thread:12           2795ns   0.000s
weak finalizer thread:13           2262ns   0.000s
weak finalizer thread:14            237ns   0.000s
15                                66734ns   0.000s
16                                48037ns   0.000s
17                                29717ns   0.000s
18                                38039ns   0.000s
19                                 9867ns   0.000s
20                                26730ns   0.000s
weak finalizer thread:21           2595ns   0.000s
22                                13469ns   0.000s
23                                33336ns   0.000s
weak finalizer thread:24           2330ns   0.000s
25                                30725ns   0.000s
26                                74979ns   0.000s
27                                10744ns   0.000s
28                                37246ns   0.000s
29                                 1535ns   0.000s
weak finalizer thread:30             87ns   0.000s
TOTAL                          26550345ns   0.027s

And here's the SVG file that puts it all in context, graphically (you'll quite likely want to click on the image and move around a bit, as it's pretty large):

So out of ~560ms of elapsed time, ~305ms were spent calling the linker, ~108ms calling the C compiler, ~20ms calling as, around 10ms in the compiler phases themselves. That accounts for 446 of the ~560ms that it took to compile the hello world program.

81 "additional" (there's very, very little overlap between GC and the user labels) milliseconds are reported to be spent in GC, that gets us to roughly 527ms, out of 560ms. Interestingly, most of the GC time happens at the beginning (see the big red bars at the top, in the SVG output).

I also see a chunk of time where none of our user labels light up at all, it's right when the seemingly unending sequence of garbage collection time comes to an end, and right before we call as.

@bgamari Is it worth hunting down where exactly that missing time is spent?

Also, do you want a merge request for the additional instrumentation of the C compiler, the linker, etc?

@alp Seems like it would be useful to add those events if they take up a significant amount of time like you suggest.

I agree with @mpickering; can you get these new events into a mergeable state and open an MR? These will be super useful.

Regarding GC time, I wonder what would happen if we started with a larger nursery.

mentioned in merge request !1388 (closed)

MR for the systools events at !1388 (closed)

I tried a few different allocation area sizes, collecting some numbers for each and generating some plots automatically from that. Overview:

Allocation area (MB)	Elapsed (s)	GC (s)	MUT (s)	Productivity (%)	Linker (s)	CC (s)	as (s)
1	0.520	0.066	0.447	85.9	0.311	0.090	0.016
2	0.540	0.078	0.461	85.4	0.309	0.093	0.018
4	0.530	0.060	0.464	87.5	0.329	0.084	0.016
8	0.521	0.050	0.460	88.4	0.309	0.091	0.016
16	0.491	0.029	0.460	93.9	0.306	0.084	0.020
32	0.481	0.020	0.458	95.3	0.309	0.085	0.016
64	0.481	0.008	0.471	97.8	0.313	0.085	0.017
128	0.492	0.009	0.478	97.1	0.314	0.085	0.016

As one can see from this table and the diagrams/plots below, we spend "a lot" of time doing GC with small allocation areas, very early in the life of this GHC command. Even for a small program like the one I'm building here (main = putStrLn "hello"), we're bound to collect some garbage during typechecking, even if very large allocation areas help a lot (but are not realistic). They don't win us all that much (a few dozen milliseconds).

We can still see 2 chunks of time for which we don't have any label: right at the beginning (loading up GHC and what not, I suppose) and right after code generation, before calling as. They're not ridiculously small, but almost neglectible when compared to the time it takes for a linker command to complete.

More details

SVG files showing when the different events took place:

Plots of a few measures as functions of the allocation area size:

Complete CSV data: summary.csv

Interesting stuff. I'm surprised that 128x increase of nursery size only makes 0.03s difference (from 0.52s to 0.49s). Am I misreading the numbers?

Ah, nvm, productivity is already great and we do very few GCs, so I think that makes sense.

Indeed, with an allocation area >= 64MB, we don't spend more than 10ms doing GC, in total, so there's presumably room for most of the things we need to allocate. And then a good chunk of the rest of the time is spent waiting on systools. Regarding the two "mystery chunks", the first one is probably just starting up GHC, and the second one might be when we emit and write to disk the assembly which is then passed down the pipeline?

Regarding the two "mystery chunks", the first one is probably just starting up GHC, and the second one might be when we emit and write to disk the assembly which is then passed down the pipeline?

We should be able to measure this, no?

runghc HelloWorld.hs is quite slow

Child items ...

Activity

More details