Hadrian: Building Stage2 GHC is a Huge Cache Miss
The problem
The majority of stage2 build rules depend on the stage1 ghc binary. If the stage1 ghc binary changes in any way (a common use case) then we can expect stage2 to have disappointingly low utilization of the cache.
Details
A fair amount of effort is being spent to enable cached builds in Hadrian. The underlying assumption is that this will greatly speed up build times. Running a clean cached build (to populate the cache), then deleting the build dir and running a build again gives about a 10x speedup!
$ ./hadrian/build.sh --share # ~30 min
$ rm -r _build
$ ./hadrian/build.sh --share # ~ 3 min
Unfortunately this very positive result is really under optimal conditions: no input files have changed so we expect to get a minimal amount of cache missed. A more common case is that the user has made some small change to ghc. As the majority of stage2 depends on stage1 ghc we can expect stage2 to have very low utilization of the cache.
- TODO Once cached builds are working somewhat reliably, confirm and quantify the performance impact of this issue.
Note that GHC has a recompilation checker that is currently turned on in hadrian (not turned off with ghc -fforce-recomp
). This could alleviate some of the performance degradation when the _build directory is already populated (not the case with CI builds).
Possible Solutions
It is in fact correct that stage2 depends on stage1 ghc: building a module withe a changed stage1 ghc may result in a different output. We can perhaps mitigate the problem by using --freeze1
which freezes stage1, there by eliminating this issue. With some luck this may work out of the box.
Effect on CI
Unfortunately CI will start with an empty _build dir, so will have no stage1 to freeze. A workaround to this would be to assume that freezing stage1 is safe in the majority of cases, then make CI jobs do this:
- checkout the last commit from the previous day
- build stage1
- checkout the commit for this CI job
- build with
--freeze1
- run the rest of this CI job.
This means jobs on any given day will build the same stage1, making good cache utilization/sharing. Then with -freeze1
individual jobs will hopefully still make good cache utilization in stage2. We can still run nightly jobs without --freeze1
(or even a subset of jobs in the normal CI pipeline).