Improve build parallelism

Currently in Batch mode (ghc --make -j), the GHC.Driver.Make.parUpsweep compiles modules in parallel, with each module waiting for all of it's dependencies to compile completely before proceeding. Parallelism is constrained by the -j flag, which limits the number of modules that may be compiled simultaneously. When multiple modules could be compiled next, ties are broken by a race to claim a semaphore.

This ticket is about improving the parallelism by making the compilation steps more finely-grained, and unblocking the compilation of next modules early.

For example, given a module A imported by modules B and C. The old way would be:

graph LR;
  A[ModA  Ps Rn Tc Ds....Cg]-->B;
  A-->C;
  B[ModB  Ps Rn Tc Ds....Cg]-->D;
  C[ModC  Ps Rn Tc Ds....Cg]-->D[ModD  Ps Rn Tc Ds....Cg];
  Ps[Ps = Parse] -.- Rn[Rn = Rename] -.-  Tc[Tc = TypeCheck] -.- Ds[Ds = Desugar] -.-Cg[Cg = Code gen] ;

while the new way would be at least as granular as:

graph LR;
%% First
  subgraph typechecking
  A1[ModA  Ps Rn Tc]-- HPT 1 -->B1;
  A1-- HPT 1 -->C1;
  B1[ModB  Ps Rn Tc]-- HPT 1 -->D1;
  C1[ModC  Ps Rn Tc]-- HPT 1 -->D1[ModD  Ps Rn Tc];
  end

%% Second
  subgraph post-typechecking
  A2[ModA  Ds C2C C2S Cg]-- HPT 2 -->B2;
  A2-- HPT 2 -->C2;
  B2[ModB  Ds C2C C2S Cg]-- HPT 2 -->D2;
  C2[ModC  Ds C2C C2S Cg]-- HPT 2 -->D2[ModD  Ds C2C C2S Cg];
  end

%% Interconnects
  A1-->A2;
  B1-->B2;
  C1-->C2;
  D1-->D2;

%% Legend
  Ps[Ps = Parse] -.- Rn[Rn = Rename] -.-  Tc[Tc = TypeCheck] -.- Ds[Ds = Desugar];
  C2C[C2C = Core-to-Core]-.- C2S[C2S = Core-to-Stg]-.- Cg[Cg = StgToCmm, C-- pipeline, and NCG];
  HPT1[HPT 1 = HPT without linkables ] -.- HPT2[HPT 2 = HPT with linkables + Unfoldings + CAFyness info]

Note that the code gen jobs depend on the other codegen jobs because of cross-module optimization. It is important that the CAFfyness information propagate from code generation to dependent modules. This information is currently injected into ModDetails by GHC.Iface.UpdateIdInfos.

Module loop

In case of module loop there is a need to do retypechecking of the entire loop. This requirement needs to be taken care of as part of "ModIface A" edge of the "final_loop" module.

-dynamic-too

This currently causes the backend pipeline (C2C C2S Cg) to run two times sequentially to generate both .o and .dyn_o. There are multiple opportunities for improvement

The C2C and C2S stages need not be run twice (ref #17502)
The next modules' compilation could be unblocked after generating only the .dyn_o (or .o ?).
The .o and .dyn_o could be generated in parallel
For -fno-code TH support there should be no need of generating both .o and .dyn_o. (TODO: figure out what is the preferred or more suitable of these two)

Feasibility testing

This idea has been implemented using some hacks here https://gitlab.haskell.org/obsidiansystems/ghc/-/tree/dn-wip-fine-grain-parsweep-1 . This has been tested to compile the Cabal in about 80% time with > 8 cores.

Implementation Plan

Clean up existing code, and split some of the driver functions

Separate recompilation avoidance code, so that it could be done before beginning typechecking

There are two reasons for doing this, the first is to make code cleaner to understand. The other is that the recompilation avoidance checks could / should be done before obtaining the semaphore. Waiting for the semaphore and then releasing it almost immediately would have much more overheads over simply doing these checks.

This step involves rewriting hscIncrementalFrontend, which currently does the recomp checks and typechecking together. (This plan assumes the checkOldIface as a black box and doesn't plan to change it, though !3204 (closed) could be included)

After this is done there would be three steps in hscIncrementalCompile. (But since the hscIncrementalCompile itself would not longer be needed, these steps could be done directly from runPhase pipeline)
- Recompilation avoidance checks (checkOldIface)
- Typecheck
- Desugar + Simplify
Cleanup compileOne

The compileOne should be a light wrapper API which simply call the runPhase pipeline (after patching the dflags and hsc_env). All other code from compileOne' should be removed. Currently it has almost identical logic to the one present in runPhase.

This cleanup would require
- Add a new PhasePlus BeginHsc BeginHscArgs
  
  where the BeginHscArgs is similar to the args currently passed to hscIncrementalCompile
  
  This would allow invoking the runPhase pipeline with the ModSummary (parsed module)
Cleanup runPhase
- Simplify HscStatus by separating out the src_flavor logic (as already done in !2603)
- Remove HscOut, as it is no longer needed. (It was needed only by compileOne)
Note: There would be a need to add another PhasePlus to begin the pipeline later in the backend codegen. (perhaps BeginCg, depends on --dynamic-too investigations)
--dynamic-too related cleanups

There is a good amount of complexity scattered in the runPipeline related to --dynamic-too handling, which should be pulled outside.

This code under the hood runs the pipeline twice, and produces two interface and object files. But ideally the runPipeline (and arguably the one-shot mode) should only produce a single output. And the logic to product multiple outputs should be part of the driver logic.

In the case of --make it would be necessary to move this logic in parUpsweep_mod, so that it could run the pipeline in parallel, and keep the critical path small by unblocking the next module's compilation early.

For the one-shot mode / GHC APIs, the compileFile / compileOne driver should have this logic.
Split hscGenHardCode

This split will be decided based on the requirements of --dynamic-too. The actual code split should be straightforward to do.

Modify `--make` to take advantage of the split phases/driver functions

Refactor out the recompilation avoidance checks in upsweep_mod

These checks are used during the interpreted mode/ghci and involve in-memory data StableModules.

These checks need to be done before the Iface checks, and should be done before obtaining the semaphore.
Remove upsweep, and use parUpsweep for single thread compilation as well.

Having two pieces of almost identical code will lead to maintenance burden and even bugs. In fact there is already a possible bug, the upsweep does hscAddSptEntries for the interpretted backend, but parUpsweep_mod does not.

In order to use the same code for the single thread mode, the parUpsweep should have identical functionality and performance as upsweep. (Though it might be nice if it could report errors from all modules, ref: #17890)

It would be great if the parUpsweep could be written in a way to have the separation in the boilerplate needed for thread management and other logic. But perhaps simply using traverse instead of forkIO should be sufficient in achieving the same behavior and performance. (In my testing parUpsweep has identical time performance as upsweep)

There is a scenario of --keep-going which will require an extra logic to maintain the same behavior.
parUpsweep fixes
- For each module parUpsweep_one will be forked, just like its currently being done. So there will be minimal changes to its code.
- parUpsweep_one will now execute the following steps directly.
  - Recompilation avoidance checks (StableModules + Iface)
  - Tc + reTypecheckLoop logic
  - Ds, C2C, C2S
  - Cg (will use runPhase pipeline, via BeginCg)
- The use of global HscEnv MVar would be removed, instead the data required by the threads from its dependencies would be passed explicitly via the MVars.
  
  This has a number of advantages over the current global MVar approach
  - Easier to understand
  - Better performance
  - Deterministic and testable
- TH splice support
  
  TODO: This would possibly require us to stop the fine-grained parallelism.
  
  There is no plan to add the dynamic adjustments to the dependency graph/driver based on the TH usage in modules.

At the end of all this there would be two mostly independent drivers, the runPhase (used by one-shot mode and GHC APIs calling compileOne) and the parUpsweep (used by --make and ghci).

The parUpsweep_one would end up doing a lot more of the steps itself, so there would be code duplication. On the bright side it would make its workings easier to understand.

Testing

TODO

Out of scope

Scheduler for fine grained driver The parallel batch-mode driver could also include a scheduler, which has the necessary heuristics to do the typechecking processes first (ie give priority to typechecking tasks over desugar + codegen).

Edited Apr 07, 2021 by Divam Narula

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message