Replace (internal) bytecode interpreter with copy-and-patch JIT compilation

Motivation

GHCi interprets code fast enough, but it can always be faster. In particular, Template Haskell has a reputation of being slow. I have not measured (yet) whether this is due to bytecode interpretation, but if that is indeed the case, we should try to improve.

Proposal

Instead of generating bytecode in GHC.StgToByteCode and interpreting it in rts/interpreter.c, do copy-and-patch compilation in GHC.StgToByteCode. That is,

Generate up-front, at GHC bootstrap time, a library of machine-code stencils that covers the semantics of every bytecode operation. These stencils have bespoke holes in them to be filled with literal constants or immediates (i.e., register names, stack location) of corresponding variable slots.
To generate code for a bytecode, query the stencil library for the appropriate stencil (there are off-the-shelf solutions for that), copy/mmap it over into the interpreting process' memory, and then patch in the appropriate constants/immediates at bespoke locations.

If done right, the resulting JIT compiler outperforms (on the pareto frontier of compilation latency and execution speed) bytecode interpreters and LLVM -O0 compilation alike.

Very exciting stuff, but perhaps not for the faint of heart; it might also pull in a big system dependency on the MetaVar framework to generate, copy and patch those stencils. This issue is just to document the possibility and perhaps refer to it in future discussions.

I think the approach is quite related to compilation to a bespoke library of supercombinators (and thus to https://github.com/augustss/MicroHs); only that we keep around compiled versions of that supercombinators and patch them up easily at runtime. Doing so is faster than calling out to a C compiler to specialise (the source code of) those supercombinators for us.

Edited Mar 29, 2024 by Sebastian Graf

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information