Improve build parallelism
Currently in Batch mode (ghc --make -j
), the GHC.Driver.Make.parUpsweep
compiles modules in parallel, with each module waiting for all of it's dependencies to compile completely before proceeding.
Parallelism is constrained by the -j flag, which limits the number of modules that may be compiled simultaneously.
When multiple modules could be compiled next, ties are broken by a race to claim a semaphore.
This ticket is about improving the parallelism by making the compilation steps more finely-grained, and unblocking the compilation of next modules early.
For example, given a module A
imported by modules B
and C
. The old way would be:
graph LR;
A[ModA Ps Rn Tc Ds....Cg]-->B;
A-->C;
B[ModB Ps Rn Tc Ds....Cg]-->D;
C[ModC Ps Rn Tc Ds....Cg]-->D[ModD Ps Rn Tc Ds....Cg];
Ps[Ps = Parse] -.- Rn[Rn = Rename] -.- Tc[Tc = TypeCheck] -.- Ds[Ds = Desugar] -.-Cg[Cg = Code gen] ;
while the new way would be at least as granular as:
graph LR;
%% First
subgraph typechecking
A1[ModA Ps Rn Tc]-- HPT 1 -->B1;
A1-- HPT 1 -->C1;
B1[ModB Ps Rn Tc]-- HPT 1 -->D1;
C1[ModC Ps Rn Tc]-- HPT 1 -->D1[ModD Ps Rn Tc];
end
%% Second
subgraph post-typechecking
A2[ModA Ds C2C C2S Cg]-- HPT 2 -->B2;
A2-- HPT 2 -->C2;
B2[ModB Ds C2C C2S Cg]-- HPT 2 -->D2;
C2[ModC Ds C2C C2S Cg]-- HPT 2 -->D2[ModD Ds C2C C2S Cg];
end
%% Interconnects
A1-->A2;
B1-->B2;
C1-->C2;
D1-->D2;
%% Legend
Ps[Ps = Parse] -.- Rn[Rn = Rename] -.- Tc[Tc = TypeCheck] -.- Ds[Ds = Desugar];
C2C[C2C = Core-to-Core]-.- C2S[C2S = Core-to-Stg]-.- Cg[Cg = StgToCmm, C-- pipeline, and NCG];
HPT1[HPT 1 = HPT without linkables ] -.- HPT2[HPT 2 = HPT with linkables + Unfoldings + CAFyness info]
Note that the code gen jobs depend on the other codegen jobs because of cross-module optimization. It is important that the CAFfyness information propagate from code generation to dependent modules. This information is currently injected into ModDetails by GHC.Iface.UpdateIdInfos.
Module loop
In case of module loop there is a need to do retypechecking of the entire loop. This requirement needs to be taken care of as part of "ModIface A" edge of the "final_loop" module.
-dynamic-too
This currently causes the backend pipeline (C2C C2S Cg) to run two times sequentially to generate both .o
and .dyn_o
. There are multiple opportunities for improvement
- The C2C and C2S stages need not be run twice (ref #17502)
- The next modules' compilation could be unblocked after generating only the
.dyn_o
(or.o
?). - The
.o
and.dyn_o
could be generated in parallel - For
-fno-code
TH support there should be no need of generating both.o
and.dyn_o
. (TODO: figure out what is the preferred or more suitable of these two)
Feasibility testing
This idea has been implemented using some hacks here https://gitlab.haskell.org/obsidiansystems/ghc/-/tree/dn-wip-fine-grain-parsweep-1 . This has been tested to compile the Cabal in about 80% time with > 8 cores.
Implementation Plan
Clean up existing code, and split some of the driver functions
-
Separate recompilation avoidance code, so that it could be done before beginning typechecking
There are two reasons for doing this, the first is to make code cleaner to understand. The other is that the recompilation avoidance checks could / should be done before obtaining the semaphore. Waiting for the semaphore and then releasing it almost immediately would have much more overheads over simply doing these checks.
This step involves rewriting
hscIncrementalFrontend
, which currently does the recomp checks and typechecking together. (This plan assumes thecheckOldIface
as a black box and doesn't plan to change it, though !3204 (closed) could be included)After this is done there would be three steps in
hscIncrementalCompile
. (But since thehscIncrementalCompile
itself would not longer be needed, these steps could be done directly from runPhase pipeline)- Recompilation avoidance checks (checkOldIface)
- Typecheck
- Desugar + Simplify
-
Cleanup compileOne
The
compileOne
should be a light wrapper API which simply call therunPhase
pipeline (after patching the dflags and hsc_env). All other code fromcompileOne'
should be removed. Currently it has almost identical logic to the one present inrunPhase
.This cleanup would require
-
Add a new
PhasePlus
BeginHsc BeginHscArgs
where the
BeginHscArgs
is similar to the args currently passed tohscIncrementalCompile
This would allow invoking the
runPhase
pipeline with theModSummary
(parsed module)
-
-
Cleanup
runPhase
-
Simplify
HscStatus
by separating out thesrc_flavor
logic (as already done in !2603) -
Remove
HscOut
, as it is no longer needed. (It was needed only bycompileOne
)
Note: There would be a need to add another
PhasePlus
to begin the pipeline later in the backend codegen. (perhapsBeginCg
, depends on --dynamic-too investigations) -
-
--dynamic-too related cleanups
There is a good amount of complexity scattered in the
runPipeline
related to --dynamic-too handling, which should be pulled outside.This code under the hood runs the pipeline twice, and produces two interface and object files. But ideally the
runPipeline
(and arguably the one-shot mode) should only produce a single output. And the logic to product multiple outputs should be part of the driver logic.In the case of --make it would be necessary to move this logic in
parUpsweep_mod
, so that it could run the pipeline in parallel, and keep the critical path small by unblocking the next module's compilation early.For the one-shot mode / GHC APIs, the
compileFile
/compileOne
driver should have this logic. -
Split
hscGenHardCode
This split will be decided based on the requirements of --dynamic-too. The actual code split should be straightforward to do.
--make
to take advantage of the split phases/driver functions
Modify -
Refactor out the recompilation avoidance checks in
upsweep_mod
These checks are used during the interpreted mode/ghci and involve in-memory data
StableModules
.These checks need to be done before the Iface checks, and should be done before obtaining the semaphore.
-
Remove
upsweep
, and useparUpsweep
for single thread compilation as well.Having two pieces of almost identical code will lead to maintenance burden and even bugs. In fact there is already a possible bug, the
upsweep
doeshscAddSptEntries
for the interpretted backend, butparUpsweep_mod
does not.In order to use the same code for the single thread mode, the
parUpsweep
should have identical functionality and performance asupsweep
. (Though it might be nice if it could report errors from all modules, ref: #17890)It would be great if the
parUpsweep
could be written in a way to have the separation in the boilerplate needed for thread management and other logic. But perhaps simply usingtraverse
instead offorkIO
should be sufficient in achieving the same behavior and performance. (In my testingparUpsweep
has identical time performance asupsweep
)There is a scenario of
--keep-going
which will require an extra logic to maintain the same behavior. -
parUpsweep
fixes-
For each module
parUpsweep_one
will be forked, just like its currently being done. So there will be minimal changes to its code. -
parUpsweep_one
will now execute the following steps directly.- Recompilation avoidance checks (StableModules + Iface)
- Tc + reTypecheckLoop logic
- Ds, C2C, C2S
- Cg (will use runPhase pipeline, via
BeginCg
)
-
The use of global
HscEnv
MVar would be removed, instead the data required by the threads from its dependencies would be passed explicitly via the MVars.This has a number of advantages over the current global MVar approach
- Easier to understand
- Better performance
- Deterministic and testable
-
TH splice support
TODO: This would possibly require us to stop the fine-grained parallelism.
There is no plan to add the dynamic adjustments to the dependency graph/driver based on the TH usage in modules.
-
At the end of all this there would be two mostly independent drivers, the runPhase
(used by one-shot mode and GHC APIs calling compileOne
) and the parUpsweep
(used by --make and ghci).
The parUpsweep_one
would end up doing a lot more of the steps itself, so there would be code duplication.
On the bright side it would make its workings easier to understand.
Testing
TODO
Out of scope
- Scheduler for fine grained driver The parallel batch-mode driver could also include a scheduler, which has the necessary heuristics to do the typechecking processes first (ie give priority to typechecking tasks over desugar + codegen).