Implementing Passive Live Range Splitting in fregs-graph

added NCG backend code generation labels

changed the description

I'm not sure if that's any related to the problems you are trying to solve. But we when we wrote the backend for an SSA-based compiler in a compiler lab, we adopted that every instruction basically has two ... "labels": One where inputs are read (e.g. uses of a live range) and one where outputs are written (defs of a new live range). This way, a write never conincides with a read. Then live range splitting is simply saying "value §42 is alive in %rax for from [0,13], then dead from [14,27], then alive in %rdx for [28,33]" and as soon as there is one dead interval (or a register permutation that you can't break without a stack slot) you allocate a spill slot.

Maybe something like that might inspire for a solution?

added Pnormal Tfeature request labels

@sgraf812 thanks for your input. I'm not sure whether this applies though. The algorithm wants to insert a spill for a splittee before the definition of a splitter. Since the program is not in SSA form, GHC Liveness distinguishes between "birth" (first def) and any subsequent write. So I chose to insert spills before each write. Maybe that workaround is enough, maybe I need to work this into building the containment graph, so that these splits can't happen in the first place.

I'm still debugging my dataflow analysis/limited renumbering and am building some visualizations to help me with that. I just wanted to comment on something I found: To have a simple test program, I used the one from #8048 (closed)

vreg %vI_nKI is used in the marked nodes of the cfg:

cHO:
  # ...
  movq %vI_nKI,80(%rbp)  # Last use in block
  addq $-64,%rbp
  # ...

cJF:
  # ...
  movq 144(%rbp),%vI_nKI
  # ...

So this is using the same vreg, but those are two distinct live ranges. The passive splitting algorithm is not made for something like that. Besides that there is no store/load to insert here, the first LR is already going dead.

This is something a proper renumbering phase should handle, IMHO. sigh

EDIT:

I just figured out, that the liveness information doesn't mean what I thought it means. For some reason I only figured that out know *face palm*

Bc. of this comment in Liveness.hs:

liveBorn :: RegSet -- ^ registers born in this instruction (written to for first time).

I assumed that refers to the virtual register, i.e., globally in the procedure. But that is not the case as one can see in block cJF:

cJF:
	# ...

	movq %vI_sGn,%vI_nKi
	    # born:    %vI_nKi
	     
	addq %vI_nKg,%vI_nKi
	    # r_dying: %vI_nKg
	     
	movq %vI_sGm,%vI_nKk
	    # born:    %vI_nKk
	     
	addq %vI_nKi,%vI_nKk
	    # r_dying: %vI_nKi

	# ...

	movq %vI_sGw,%vI_nKI
	    # born:    %vI_nKI
	    # r_dying: %vI_sGw
	     
	addq %vI_nKG,%vI_nKI
	    # r_dying: %vI_nKG
	    
	movq %vI_nKI,-8(%r12)
	    # r_dying: %vI_nKI
	    
	# ...

So within the same block %vI_nKI is dying/born twice. Well, that violates my assumption - I guess it's back to the drawing board...

I'm wondering what would happen if we simply renamed all these distinct live ranges. Doing this for births+deaths within a basic block should be easy. That could make coloring easier, since those disjoint LRs won't have to receive the same hardreg. But it would increase the size of the conflict graph quite a bit.

Anyway, I'm more confused than before...

changed the description

Just FYI, my recent realizations on this issue (also documenting this for my scatterbrain):

I think some of my confusion came from (ambiguous or misunderstood?) terminology, e.g., there is live range splitting and "live range splitting". I thought of a LR as a web of use-def and def-use chains, i.e., something that has to be assigned to a single hard reg. Splitting here means something like in the "Passive LR splitting" paper, where we need to insert stores/loads to break them up. Afterwards "renumbering" or however you may want to call it, would have to introduce new vreg names (and copies).

Here we have disjoint webs baring the same vreg name, so they are actually - unnecessarily - constrained to receive the same hreg.

I've done some reading over the weekend and the gist is "compute all the def-use and use-def chains via dataflow analysis and representing them takes (m * n) space and is so complicated, but we won't explain this in detail bc. SSA transformation takes care of all that and more, so here is a chapter on that.".

Looking at what's there and cost/benefit (both in compile and implementation time), I've decided to implement a "sort of" pruned-SSA form transformation and destruction. "sort of" bc. I won't introduce a new IR and x86's 2-Address Instructions aren't exactly SSA semantics. This is fine though, bc. this is primarily for register allocation.

I'll be heavily relying on Andreas' CFG work, so this will only work on supported platforms (x86_64).

mentioned in issue #19453

changed the description

closed

@cptwunderlich I saw that you closed this ticket. Is this because you've come to think it's not worthwhile to implement anymore? Because you won't work on it anymore? Or something else?

Mostly, because I have abandoned this effort to write a graph coloring register allocator on SSA form.

I'm also not sure if this is the right approach for -fregs-graph. There are some additional complications, e.g., critical edges need to be split first. And it will need renumbering after splitting, so one would have to repeatedly run my SSA construction and destruction (actually, now that I have implemented the Braun et al. algorithm it's simpler, bc. that can be used to repair SSA.) Furthermore, I did some reading and e.g. [1] claims that:

The spilling approach of Chaitin-style allocators is very crude since the spilling decisions are only taken based on the coloring heuristic becoming stuck. [...] In practice however, one often uses a spilling phase before starting the allocation (see Paleczny et al. [2001] or Morgan [1998]).

So I think that there are lots of ways to improve -fregs-graph and that this is probably not the most promising one.

Since I already have implemented SSA transformation, I want to see if the SSA based GCRA of Hack et al. works better. It allows for separate spilling, coloring and coalescing phases and thus also easier experimentation with heuristics. I also think it will be easier to spill partial LRs based on location (loop bodies). Oh and it does a "natural" kind of live range splitting, simply bc. it is in SSA form. There are also tons of possible improvements I have in mind (rematerialization, coloring for spill slots), but first I need to get a minimum viable prototype going.

EDIT: Ah yes, I almost forgot. I added the section "complications" to the original ticket 4 months ago. The follow-up paper, of which I was not aware when starting out, actually raises a lot of issues not addressed in the original paper. Especially the whole part about "oh, our algorithm doesn't work correctly on 2-Address Instructions, we have implemented a fix, but we don't have room here to present it" is quite the deal breaker for x86...

mentioned in issue #21453 (closed)

Implementing Passive Live Range Splitting in fregs-graph

General Idea

Implementation

Complications

Renumbering

Critical Edges

Call-clobbered Registers

2-Address Instructions

General Optimizations

Child items ...

Activity