<Peter12> Hi! One question. I tested 8.6.5. on a 5Ghz machine. runghc for 'hello world' takes 0.1 seconds. Anyway to speed that up?<Peter12> ghci is slightly faster, hint is slower, ghc -O0 and running the program is significantly slower.<Peter12> This is holding back my research, so I would be very happy for pointers.<Peter12> I guess ghc is single threadded, so runghc won't get faster with more cores.<Peter12> Just as a reference, while runghc takes 0.1 seconds, hugs runs in 0.02 seconds for a hello world program.
Another thing to consider: the above measurements were taken after I had deleted the default package environment file populated by cabal-install. Before I had done that the runtime was nearly double what I report above.
The RPATH size reduction idea was implemented in Hadrian in #11587 (closed). The number of failed openat calls is reduced dramatically as a result. Sadly this made little difference in overall runtime.
I ended up patching GHC to instrument all calls to external programs in compiler/main/SysTools/Task.hs, in addition to what Ben did in !783 (merged), to see how much time is spent calling the C compiler, as, the linker, etc. Here is what I saw on my machine:
$ time _build/stage1/bin/ghc -O0-ddump-timings hi.hs +RTS -l-s*** Chasing dependencies:Chasing dependencies: alloc=207992 time=0.465[1 of 1] Compiling Main ( hi.hs, hi.o )*** Parser [Main]:Parser [Main]: alloc=59768 time=0.108*** Renamer/typechecker [Main]:Renamer/typechecker [Main]: alloc=14212440 time=33.033*** Desugar [Main]:Desugar [Main]: alloc=124584 time=2.494*** Simplifier [Main]:Simplifier [Main]: alloc=142424 time=0.212*** CoreTidy [Main]:CoreTidy [Main]: alloc=1774264 time=2.719*** CorePrep [Main]:CorePrep [Main]: alloc=10256 time=0.030*** CodeGen [Main]:CodeGen [Main]: alloc=1365856 time=1.635Linking hi ... 55,739,352 bytes allocated in the heap 35,411,920 bytes copied during GC 5,975,472 bytes maximum residency (5 sample(s)) 123,472 bytes maximum slop 5 MB total memory in use (0 MB lost due to fragmentation) Tot time(elapsed) Avg pause Max pause Gen 0 40 colls, 0 par 0.021s 0.021s 0.0005s 0.0024s Gen 1 5 colls, 0 par 0.060s 0.060s 0.0120s 0.0170s TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.000s ( 0.000s elapsed) MUT time 0.030s ( 0.472s elapsed) GC time 0.082s ( 0.081s elapsed) EXIT time 0.000s ( 0.006s elapsed) Total time 0.112s ( 0.560s elapsed) Alloc rate 1,886,295,850 bytes per MUT second Productivity 26.4% of total user, 84.3% of total elapsedreal 0m0.585suser 0m0.466ssys 0m0.118s$ ghc-events-analyze --timed --timed-txt --totals --start "GHC:started:" --stop "GHC:finished:" ghc.eventlog Generated ghc.totals.txt using default scriptGenerated ghc.timed.svg using default scriptGenerated ghc.timed.txt using default script
And here's the SVG file that puts it all in context, graphically (you'll quite likely want to click on the image and move around a bit, as it's pretty large):
So out of ~560ms of elapsed time, ~305ms were spent calling the linker, ~108ms calling the C compiler, ~20ms calling as, around 10ms in the compiler phases themselves. That accounts for 446 of the ~560ms that it took to compile the hello world program.
81 "additional" (there's very, very little overlap between GC and the user labels) milliseconds are reported to be spent in GC, that gets us to roughly 527ms, out of 560ms. Interestingly, most of the GC time happens at the beginning (see the big red bars at the top, in the SVG output).
I also see a chunk of time where none of our user labels light up at all, it's right when the seemingly unending sequence of garbage collection time comes to an end, and right before we call as.
@bgamari Is it worth hunting down where exactly that missing time is spent?
Also, do you want a merge request for the additional instrumentation of the C compiler, the linker, etc?
I tried a few different allocation area sizes, collecting some numbers for each and generating some plots automatically from that. Overview:
Allocation area (MB)
Elapsed (s)
GC (s)
MUT (s)
Productivity (%)
Linker (s)
CC (s)
as (s)
1
0.520
0.066
0.447
85.9
0.311
0.090
0.016
2
0.540
0.078
0.461
85.4
0.309
0.093
0.018
4
0.530
0.060
0.464
87.5
0.329
0.084
0.016
8
0.521
0.050
0.460
88.4
0.309
0.091
0.016
16
0.491
0.029
0.460
93.9
0.306
0.084
0.020
32
0.481
0.020
0.458
95.3
0.309
0.085
0.016
64
0.481
0.008
0.471
97.8
0.313
0.085
0.017
128
0.492
0.009
0.478
97.1
0.314
0.085
0.016
As one can see from this table and the diagrams/plots below, we spend "a lot" of time doing GC with small allocation areas, very early in the life of this GHC command. Even for a small program like the one I'm building here (main = putStrLn "hello"), we're bound to collect some garbage during typechecking, even if very large allocation areas help a lot (but are not realistic). They don't win us all that much (a few dozen milliseconds).
We can still see 2 chunks of time for which we don't have any label: right at the beginning (loading up GHC and what not, I suppose) and right after code generation, before calling as. They're not ridiculously small, but almost neglectible when compared to the time it takes for a linker command to complete.
More details
SVG files showing when the different events took place:
Plots of a few measures as functions of the allocation area size:
Indeed, with an allocation area >= 64MB, we don't spend more than 10ms doing GC, in total, so there's presumably room for most of the things we need to allocate. And then a good chunk of the rest of the time is spent waiting on systools. Regarding the two "mystery chunks", the first one is probably just starting up GHC, and the second one might be when we emit and write to disk the assembly which is then passed down the pipeline?
Regarding the two "mystery chunks", the first one is probably just starting up GHC, and the second one might be when we emit and write to disk the assembly which is then passed down the pipeline?