Transform switch loop into indirect jump threading

Motivation

@chessai mentioned that performing a transformation on could be useful.

I think the suggestion corresponds to indirect threaded code in this writeup from https://www.complang.tuwien.ac.at/forth/threaded-code.html

Indirect Threaded Code description

Indirect Threaded Code

Let's consider constants: They can be represented by the virtual machine instruction lit (for literal), followed by the value of the constant. If the constant occurs frequently, it is more space-efficient to have a virtual machine instruction for this constant. If we have several constants, the code for their virtual machine instructions will look very similar. So, we may want to share the code, while having separate data.
We can do this by adding a level of indirection in NEXT (resulting in indirect threaded code. Each word (a generalization of the virtual machine instruction) has a code field and a parameter field. E.g., in the case of the constant:

code-field:      docon #code address of shared CONstant code
parameter-field: value #value of constant
The virtual machine code is now represented by a sequence of code field addresses, not code addresses. Simple virtual machine instructions (primitives) are typically represented like this:
code-field2:      parameter-field2
parameter-field2: code #machine code of the primitive
The NEXT of indirect threaded code looks like this in MIPS assembly language:
lw   $2,0($4) #get the code-field address of the next word, $4=inst.ptr.
#nop          #load delay slot
lw   $3,0($2) #get the code address of the word
addu $4,$4,4  #advance instruction pointer
j    $3       #execute code for the word
#nop          #branch delay slot
The code for the next instruction can compute the parameter-field address from the code-field address in $770

Here is the original mail

original mail from ghc-devs

I thought Andreas Klebinger in particular might find this interesting.

The talk is here: https://youtu.be/IAdLwUXRUvg?t=1867

(timestamp is 31:07)

basically, if you have an interpreter consisting of a switch statement inside an infinite loop, there's a mechanical transformation using GOTOs that causes a 15% speedup in the switch statement blob, there's n + 1 basic blocks, each of which has a jmp back to the first basic block with probability 1, and the first basic block has a jmp that will go to one of the other n basic blocks with probability 1 / n whereas in the goto statement setup, there's n basic blocks, each of which has a jmp that will go to one of the other n - 1 basic blocks with probability 1 / (n - 1) so if your sequence of instructions is produced by a markov decision process, then the branch predictor will efficiently predict in the latter situation but not necessarily the former in the goto setup the callgraph looks like a complete graph on n vertices whereas in the switch setup it looks like a bipartite graph on (1, n) vertices (where the edges represent both "is called by" and "calls") and because of all those extra edges (n^2 rather than 2n) there's more spots for the branch predictor to build up data

At 39:13 he explains that you get per-address jmp predictions. If you only share a single relative jmp, the cpu only knows how common every opcode is. But, if you have a jmp from each opcode, to the next, the cpu can determine how common an opcode A is followed an opcode B.

He references godbolt (https://godbolt.org/), which seems like a useful tool. and mentions that ruby, jvm (on android only?), and some arm compilers implement this.

There are some tradeoffs - code size does increase. According to a GCC developer in the audience, GCC is not currently able to do this. I have no idea about Clang.

Thanks

Proposal

The main idea is to perform the following transformation:

loop:
switch(e) {
    case 1:
        stmt;
        goto loop;
    case 2:
        stmt;
        goto loop;
    ...
    case last:
        goto end;
}

into

goto labels[e];

label1:
    stmt;
    goto labels[e]
label2:
    stmt;
    goto labels[e]
label_last:
    goto end
...

If this is indeed a reasonable common pattern to appear in the intermediate representation
produced by GHC it shouldn't be too hard to do the transformation.

I'm however not sure if GHC does translate high level recursive functions to this Cmm pattern to begin with.  
Something that needs to be verified first before it's worth thinking about this further.

Edited Oct 15, 2019 by Andreas Klebinger

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

Transform switch loop into indirect jump threading

Motivation

Proposal