SSA-based tree-scan register allocator for NCG
Motivation
I was looking into ways to improve -fregs-graph
(see #7679 and #16243) and register allocation in general.
In theory, linear scan is supposed to be "fast" and graph-coloring can generate better code, as it has a global view on live ranges. In practice, it's a bit more complicated and NCG's linear scan allocator seems to be well tune.
Either way, an algorithm aware of CFG structure, i.e., loops, should do better. I started by trying to implement live range splitting in -fregs-graph (#19274 (closed)) and had to implement SSA transformation (#19453) for renumbering (renaming vregs). That failed, but it made me aware of SSA-based register allocation techniques.
Single Static Assignment (SSA) form IRs have been common in many production compilers for decades. Traditionally, out-of-SSA transformation happened before register allocation. Around 2005, several research groups noticed that interference graphs of SSA-form programs are chordal graphs, which can be colored in polynomial time. I won't go into the details (see the literature), but of course this doesn't mean that register allocation can be done in polytime, as we can't run SSA-form on a CPU. But it does offer new possibilities. Namely that register pressure is equal to clique size and lowering register pressure to the number of available registers assures successful coloring (for a heuristic like optimistic coloring, it may still fail to find a register and introduce an unnecessary spill!).
Proposal
I've been looking at Hack, Braun, etcs. work. The SSA-form induce a sort of live range split on the program (at merging control flow -> Phi nodes) and it allows for a decoupled design, where each phase (spill, color, coalesce) runs separately. Unlike Chaitin-Briggs, where we run the algorithm iteratively.
There are several designs, but I chose one which avoids building the interference graph (IG), as it promises to be faster, while the resulting performance should only be slightly worse. But the modular nature of the design would allow us to implement the "graph coloring with re-coloring" approach separately (and reuse everything else).
The phases are:
SSA construction ("Simple and Efficient Construction of Static Single Assignment Form"; Braun; 2013)
Transform our assembly-like IR into SSA-form. This is #19453 with the Braun algo.
Next-Use-Distance Analysis ("Register Spilling and Live-Range Splitting for SSA-Form Programs"; Braun, Hack; 2009)
This is an "extended" liveness analysis, which calculates next-use-distances, i.e., for a live range l and a program point p, how many instructions are between p and the next use of l. Loops are considered to execute frequently, so any use "after" the loop has a much higher distance. Calculation of maximum loop register pressure is done here as well (i.e., how many values are live at once at most for each loop).
Loop-aware spilling ("Register Spilling and Live-Range Splitting for SSA-Form Programs"; Braun, Hack; 2009)
This applies Belady's-Algorithm to each basic block and connects them with fixup code, where needed. Belady's-Algorithm was originally intended for page caches, where it is used to evict the page with the furthest use to make room for a new one. Similarly, we use the next-use-distance to spill values whose next-use is furthest in the future, as we need to make room for new values. Loop handling is specialized: Try to keep as many values as possible in registers, that are used in the loop. Only keep values that are live-through (but not used inside), if they won't have to be spilled (bc. register pressure would be too high). This will work to place as many spills/reloads outside of the loop as possible.
Register assignment & coalescing ("Preference-Guided Register Assignment"; Braun, Mallon, Hack; 2010)
Compute register preferences for all LRs. Then perform register assignment "trace-wise". Blocks are sorted by their cumulative (statically estimated) execution frequency. We try to find a path from the "bottom" (reverse-RPO) to the start/already assigned block, which maximizes execution frequency. I.e., we try to color more frequently executed paths through the program first. Mismatches between traces have to be rectified with shuffle code.
This somewhat assumes that spills (memory-access) are expensive and moves/swaps are cheap. This is also the only part, that may introduce additional spills after the spilling phase, if there are no free registers and regs can't be swapped (the impact of this is very architecture dependent, e.g., x86 has XCHG, so we can always swap GPRs without needing additional regs).
Outcome & Risks
I hope that my prototype can beat the current allocators, or at least come very close, as the first iteration won't be very optimized. In an ideal scenario, this could become the foundation for register allocation and we could also have an SSA-based linear scan allocator, so we can reduce some of the code duplication.
Of course, the current linear scan allocator is well tuned, and it's not clear whether or when a new allocator will beat it an by how much - I'm aware that it's not worth the hassle if it's faster by 0.5%. Compile times are also always an issue. I have chosen the approach without IG for speed, but there are still lots of steps involved. The first version won't be optimized at all, so speed is not clear. Furthermore, I'm using some expensive bits of NCG, e.g., static exec. freq. estimation and loop analysis. Those are performed later for block layout optimization and it would be better to reuse those results.
I'm also making heavy use of the CFG, which is currently only available for x86_64!
Either way, it's a great learning experience and at worst it might yield some insights on how to improve current register allocation. E.g., SSA-transformation can be used for renumbering, which is needed for any approach which splits LRs (that includes spilling them only partially). Or a Belady-based pre-spill-pass is possible (and according to literature done in some compilers).