Implementing Passive Live Range Splitting in fregs-graph
This ticket's purpose is to document the progress and lessons learned of implementing Cooper, Simpson "Passive Live Range Splitting"[1] in the graph coloring register allocator of NCG.
Idea and initial discussion of the topic comes from #16243 by @AndreasK
General Idea
GHC produces sometimes rather long live ranges, for which no register can be found to hold them in for their entire lifetime. If there are regions not using those LRs, it would be beneficial to break them up into smaller LRs, instead of spilling them everywhere. Which LRs to choose and where to break them up is the difficulty.
"Passive Live Range Splitting" is attractive, because it can be simply integrated into an existing chaitin-briggs style allocator. When a live range l is chosen for spilling, the algorithm tries to find one or more LRs to split around l, in order to find a color for l. This means, this approach only incurs a cost if a live range would have to be spilled anyway (except for the up-front cost of building the containment graph).
Implementation
The containment graph and its generation can be integrated with the conflict graph.
Besides chooseSpill
, which only selects a LR for spilling, we also need the actual spill cost function and a split cost function. chooseSpill
may use any spill heuristic, but spill cost must reflect the number of expected STOREs and RELOADs inserted, to be comparable with split cost.
Where colorGraph
finds a list of LRs to spill, we can apply findColor
to calculate which LRs could get a color by splitting around them instead.
Complications
[1] is deceptively simple and there are some issues:
Renumbering
The algorithm only figures out where to insert STOREs and RELOADs, but that alone doesn't create new LRs. It relies on the "renumber" phase of the chaitin-briggs allocator, which analyses the program's SSA form to find def-use chains and assign new virtual register names to them.
This phase simply doesn't exist in NCG. My approach is to have each RELOAD introduce a new name and a "rename" (r -> r'). Then, a kind of reaching definition dataflow analysis has to be performed to find out which renames reach a given block. Multiple, conflicting, incoming names must be handled with copies inserted in the origin blocks to a new name r''. Subsequent instructions using r must be patched.
How costly in terms of compilation time this is and whether it pays off, remains to be seen.
Critical Edges
There is a follow up paper "Improved Passive Splitting" [2]. Besides a new approach to keep splits out of loops, there are some amendments to the original paper.
The original algorithm has the prerequisite, that no unsplit critical edges exist in the CFG. It seems like GHC doesn't generate critical edges and I have added a check to -dasm-lint
.
EDIT: this is very wrong and based on a buggy check. There are critical edges.
Call-clobbered Registers
Also from [2], section 5.1, issues of the original algorithm with callsites. Doesn't seem to bite us, but I'll keep an eye on it.
2-Address Instructions
From [2], section 5.2, issues with instructions reading AND modifying their operands. I thought this would not affect us, bc. the Live-Info distinguishes between "reg born" and subsequent writes. But then I found this:
# Split %vI_s5yI around %vI_s5I0
RELOAD SLOT(3),%vI_n6lj
movb $0,(%vI_n6lj,%vI_s5I0,1)
SPILL %vI_s5yI,SLOT(2) # SplitStore
incq %vI_s5I0
RELOAD SLOT(2),%vI_n6lk # SplitLoad
Besides %vI_s5yI
being spilled in a previous basic block, that code is wrong bc. a splitter and splittee LR should never be live at the same time - the idea is for them to share the same real reg.
The paper does not explain how they resolved that issue and I could not find anything on that subject.
I've tried simply not to store when a reg is "modified" (i.e., read and written in the same instruction). But that doesn't seem like a proper solution. It could lead to an infinite loop, bc. the algorithm decides to split to resolve a conflict, but that conflict won't be resolved. I only discovered this yesterday and need to thing about this some more.
General Optimizations
To mitigate the performance impact, I thought about an optimization that would also benefit the original spill-only code. Maybe the spill and coalescing stages could update the liveness info and conflict graph on-the-fly, removing the need for a complete renewal and rebuilding of the graph.
It's quite straight forward for simple spills (LR is only live during spill instructions, conflicts with anything that is live across the instruction). For coalescing: would simply joining the conflict sets of both LRs be sufficient, or am I missing something?
Splits and renames might complicate that, but I guess it could be done.
[1] Cooper, Simpson; 1998; "Live range splitting in a graph coloring register allocator"; DOI: https://doi.org/10.1002/(SICI)1097-024X(199706)27:6%3C701::AID-SPE104%3E3.0.CO;2-0
[2] Cooper, Eckhardt; 2005; "Improved Passive Splitting"; In Proceedings of PLC 2005;