Pseudo-SSA in NCG
I've implemented a (pseudo)SSA construction and destruction in NCG, motivated by my findings for #19274
Ticket #16243 mentions "live range splitting" as a potential improvement for -fregs-graph. LR splitting will break up existing def-use-chains by inserting SPILL/LOAD instructions. In order for the allocator to assign them different hardregs, "renumbering" must be performed to find "webs" (intersecting def-use chains) and assign them a unique vreg name.
While looking at the program from #8048, I noticed that GHC produces disjoint webs using the same virtual register from the get go. This constrains them to be allocated to the same hardreg without any need and creates false interferences.
%vI_nKI is used in the marked nodes of the cfg:
cHO: # ... movq %vI_nKI,80(%rbp) # Last use in block addq $-64,%rbp # ... cJF: # ... movq 144(%rbp),%vI_nKI # ...
So discovering and renaming webs would be benefitial for register allocation even without LR splitting (and is a prerequisite of it).
The classical approach would be performing a reaching definition dataflow analysis to discover du-chains and perform intersection tests to find intersecting du-chains. That is both intensive in computation time and memory, so the literature generally favours SSA transformation, as SSA names naturally represent du-chains. There is also a rich corpus of literature describing SSA based optimizations and transformations.
I want to avoid introducing yet another intermediate representation and as a pragmatic approach re-use as much of NCG as I can, while introducing SSA. Therefore, I transformed the assembly directly, renaming vregs in the instructions and by representing Phi nodes not as instructions, but alongside basic blocks. There are some issues with assembly, see the next section.
For starters, transforming the code to SSA (construction) and immediately destruct it (out-of-ssa transformation) will split up the webs. Further optimizations can be added in the future.
Currently I'm using the existing liveness analysis to produce pruned-SSA, which is expensive but we want to get rid of all unproductive Phi-nodes before destruction anyway. See "Improvements" for alternatives.
Without any optimizations on the SSA form after construction, the code is still guaranteed to be in "Conventional SSA" (CSSA), so one can use the most simply destruction scheme, namely using union-find on the Phi-nodes to simply rename all variables in the same web to the same name. In the future, more elaborate translation schemes might be necessary, which insert copies.
This relies on the CFG, which is currently only kept up-to-date for x86-64, so this is the only platform currently supported.
Problems with Machine Instructions
The defining property of SSA is, that each assignment of a value creates a new name and that each definition of a name dominates all its uses.
Modifying and 2-Address-Instructions present a problem here. Strictly speaking, I would have to split them up and consider register constraints in destruction. E.g.:
inc %vI_v1 # Is equal to v1_1 = v1_0 + 1 # Or additon add $42, vI_v2 # Is v2_1 = v2_0 + 42
Currently I'm not considering modifications definitions in my implementation. For splitting up webs, that produces correct code, since src/dst registers have to be the same anyway.
It will be necessary to consider the impact on other optimizations though. Without having checked it, I think that things like constant propagation and copy folding should be OK, but anything involving code motion probably not. This is just an instinct though.
Other ISAs may present additional challenges, e.g., on ARM all instructions can be conditional. So is
MOVEQ V1, V2 a definition of V2 or not?
So for additional platforms and optimization, more work will be required. Maybe papers like [LeungG99] and [Dinechin14] could be relevant.
The first prototype is a "first make it work, then make it fast" approach. It uses the current liveness analysis and then build pruned-SSA with the classic algorithms by Cytron et al. (mostly based on [CooperT12] and [SSABook]). Some possible improvements:
- Simply profiling and improving the implementation
- Build semi-pruned SSA and perform liveness analysis on the SSA form
- Iterative computation of DF+ using DJ-Graphs - I think there is quite some literature about this, like from Sreedhar et al. But profiling my prototype actually shows
renamveVarsto be expensive and not this step...
- Using newer/different algorithms all together, e.g., [BraunBHLMZ13]
I think there are also some ways to speed up register allocation using SSA, e.g. building the interference graph. Some more interesting things to read can be found at http://www.dcs.gla.ac.uk/~jsinger/ssa.html
[CooperT12] K.D. Cooper, L. Torczon; Engineering a Compiler, 2nd Edition
[SSABook] Various authors; Inria: Static Single Assignment Book, Draft from 30.05.2018; http://ssabook.gforge.inria.fr/latest/book.pdf
[LeungG99] A. Leung, L. George; Static Single Assignment Form for machine Code; https://doi.org/10.1145/301618.301667
[Dinechin14] Benoı̂t Dupont de Dinechin; Using the SSA-Form in a Code Generator; https://doi.org/10.1007/978-3-642-54807-9_1
[BraunBHLMZ13] Braun et al.; Simple and Efficient Construction of Static Single Assignment Form; https://doi.org/10.1007/978-3-642-37051-9_6