Commit df108ac9 authored by sewardj's avatar sewardj
Browse files

[project @ 2002-02-07 15:31:20 by sewardj]

Add many details about bytecode generation, the interpreter, and
compiled/interpreted code interop.
parent a9ba9584
......@@ -20,7 +20,7 @@
has proven to be extremely difficult. In retrospect it may be
argued a design flaw that GHC's implementation of the STG
execution mechanism provides only the weakest of support for
automated internal consistency checks. This renders it hard to
automated internal consistency checks. This makes it hard to
debug.
<p>
Execution failures in the interactive system can be due to
......@@ -123,12 +123,253 @@
</ul>
<h2>Entering and returning between interpreted and compiled code</h2>
<h2>Useful stuff to know about the interpreter</h2>
The code generation scheme is straightforward (naive, in fact).
<code>-ddump-bcos</code> prints each BCO along with the Core it
was generated from, which is very handy.
<ul>
<li>Simple lets are compiled in-line. For the general case, let
v = E in ..., E is compiled into a new BCO which takes as
args its free variables, and v is bound to AP(the new BCO,
free vars of E).
<p>
<li><code>case</code>s as usual, become: push the return
continuation, enter the scrutinee. There is some magic to
make all combinations of compiled/interpreted calls and
returns work, described below. In the interpreted case, all
case alts are compiled into a single big return BCO, which
commences with instructions implementing a switch tree.
</ul>
<p>
<b>ARGCHECK magic</b>
<p>
You may find ARGCHECK instructions at the start of BCOs which
don't appear to need them; case continuations in particular.
These play an important role: they force objects which should
evaluated to BCOs to actually be BCOs.
<p>
Typically, there may be an application node somewhere in the heap.
This is a thunk which when leant on turns into a BCO for a return
continuation. The thunk may get entered with an update frame on
top of the stack. This is legitimate since from one viewpoint
this is an AP which simply reduces to a data object, so does not
have functional type. However, once the AP turns itself into a
BCO (so to speak) we cannot simply enter the BCO, because that
expects to see args on top of the stack, not an update frame.
Therefore any BCO which expects something on the stack above an
update frame, even non-function BCOs, start with an ARGCHECK. In
this case it fails, the update is done, the update frame is
removed, and the BCO re-entered. Subsequent entries of the BCO of
course go unhindered.
<p>
The optimised (<code>#undef REFERENCE_INTERPRETER</code>) handles
this case specially, so that a trip through the scheduler is
avoided. When reading traces from <code>+RTS -D2 -RTS</code>, you
may see BCOs which appear to execute their initial ARGCHECK insn
twice. The first time it fails; the interpreter does the update
immediately and re-enters with no further comment.
<p>
This is all a bit ugly, and, as SimonM correctly points out, it
would have been cleaner to make BCOs unpointed (unthunkable)
objects, so that a pointer to something <code>:: BCO#</code>
really points directly at a BCO.
<p>
<b>Stack management</b>
<p>
There isn't any attempt to stub the stack, minimise its growth, or
generally remove unused pointers ahead of time. This is really
due to lazyness on my part, although it does have the minor
advantage that doing something cleverer would almost certainly
increase the number of bytecodes that would have to be executed.
Of course we SLIDE out redundant stuff, to get the stack back to
the sequel depth, before returning a HNF, but that's all. As
usual this is probably a cause of major space leaks.
<p>
<b>Building constructors</b>
<p>
Constructors are built on the stack and then dumped into the heap
with a single PACK instruction, which simply copies the top N
words of the stack verbatim into the heap, adds an info table, and zaps N
words from the stack. The constructor args are pushed onto the
stack one at a time. One upshot of this is that unboxed values
get pushed untaggedly onto the stack (via PUSH_UBX), because that's how they
will be in the heap. That in turn means that the stack is not
always walkable at arbitrary points in BCO execution, although
naturally it is whenever GC might occur.
<p>
Function closures created by the interpreter use the AP-node
(tagged) format, so although their fields are similarly
constructed on the stack, there is never a stack walkability
problem.
<p>
<b>Unpacking constructors</b>
<p>
At the start of a case continuation, the returned constructor is
unpacked onto the stack, which means that unboxed fields have to
be tagged. Rather than burdening all such continuations with a
complex, general mechanism, I split it into two. The
allegedly-common all-pointers case uses a single UNPACK insn
to fish out all fields with no further ado. The slow case uses a
sequence of more complex UPK_TAG insns, one for each field (I
think). This seemed like a good compromise to me.
<p>
<b>Perspective</b>
<p>
I designed the bytecode mechanism with the experience of both STG
hugs and Classic Hugs in mind. The latter has an small
set of bytecodes, a small interpreter loop, and runs amazingly
fast considering the cruddy code it has to interpret. The former
had a large interpretative loop with many different opcodes,
including multiple minor variants of the same thing, which
made it difficult to optimise and maintain, yet it performed more
or less comparably with Classic Hugs.
<p>
My design aims were therefore to minimise the interpreter's
complexity whilst maximising performance. This means reducing the
number of opcodes implemented, whilst reducing the number of insns
despatched. In particular there are only two opcodes, PUSH_UBX
and UPK_TAG, which deal with tags. STG Hugs had dozens of opcodes
for dealing with tagged data. In cases where the common
all-pointers case is significantly simpler (UNPACK) I deal with it
specially. Finally, the number of insns executed is reduced a
little by merging multiple pushes, giving PUSH_LL and PUSH_LLL.
These opcode pairings were determined by using the opcode-pair
frequency profiling stuff which is ifdef-d out in
<code>Interpreter.c</code>. These significantly improve
performance without having much effect on the uglyness or
complexity of the interpreter.
<p>
Overall, the interpreter design is something which turned out
well, and I was pleased with it. Unfortunately I cannot say the
same of the bytecode generator.
<h2><code>case</code> returns between interpreted and compiled code</h2>
Variants of the following scheme have been drifting around in GHC
RTS documentation for several years. Since what follows is
actually what is implemented, I guess it supersedes all other
documentation. Beware; the following may make your brain melt.
In all the pictures below, the stack grows downwards.
<p>
<b>Returning to interpreted code</b>.
<p>
Interpreted returns employ a set of polymorphic return infotables.
Each element in the set corresponds to one of the possible return
registers (R1, D1, F1) that compiled code will place the returned
value in. In fact this is a bit misleading, since R1 can be used
to return either a pointer or an int, and we need to distinguish
these cases. So, supposing the set of return registers is {R1p,
R1n, D1, F1}, there would be four corresponding infotables,
<code>stg_ctoi_ret_R1p_info</code>, etc. In the pictures below we
call them <code>stg_ctoi_ret_REP_info</code>.
<p>
These return itbls are polymorphic, meaning that all 8 vectored
return codes and the direct return code are identical.
<p>
Before the scrutinee is entered, the stack is arranged like this:
<pre>
| |
+--------+
| BCO | -------> the return contination BCO
+--------+
| itbl * | -------> stg_ctoi_ret_REP_info, with all 9 codes as follows:
+--------+
BCO* bco = Sp[1];
push R1/F1/D1 depending on REP
push bco
yield to sched
</pre>
On entry, the interpreted contination BCO expects the stack to look
like this:
<pre>
| |
+--------+
| BCO | -------> the return contination BCO
+--------+
| itbl * | -------> ret_REP_ctoi_info, with all 9 codes as follows:
+--------+
: VALUE : (the returned value, shown with : since it may occupy
+--------+ multiple stack words)
</pre>
A machine code return will park the returned value in R1/F1/D1,
and enter the itbl on the top of the stack. Since it's our magic
itbl, this pushes the returned value onto the stack, which is
where the interpreter expects to find it. It then pushes the BCO
(again) and yields. The scheduler removes the BCO from the top,
and enters it, so that the continuation is interpreted with the
stack as shown above.
<p>
An interpreted return will create the value to return at the top
of the stack. It then examines the return itbl, which must be
immediately underneath the return value, to see if it is one of
the magic <code>stg_ctoi_ret_REP_info</code> set. Since this is so,
it knows it is returning to an interpreted contination. It
therefore simply enters the BCO which it assumes it immediately
underneath the itbl on the stack.
<p>
<b>Returning to compiled code</b>.
<p>
Before the scrutinee is entered, the stack is arranged like this:
<pre>
ptr to vec code 8 ------> return vector code 8
| | ....
+--------+ ptr to vec code 1 ------> return vector code 1
| itbl * | -- Itbl end
+--------+ \ ....
\ Itbl start
----> direct return code
</pre>
The scrutinee value is then entered.
The case continuation(s) expect the stack to look the same, with
the returned HNF in a suitable return register, R1, D1, F1 etc.
<p>
A machine code return knows whether it is doing a vectored or
direct return, and, if the former, which vector element it is.
So, for a direct return we jump to <code>Sp[0]</code>, and for a
vectored return, jump to <code>((CodePtr*)(Sp[0]))[ - ITBL_LENGTH
- vector number ]</code>. This is (of course) the scheme that
compiled code has been using all along.
<p>
An interpreted return will, as described just above, have examined
the itbl immediately beneath the return value it has just pushed,
and found it not to be one of the <code>ret_REP_ctoi_info</code> set,
so it knows this must be a return to machine code. It needs to
pop the return value, currently on the stack, into R1/F1/D1, and
jump through the info table. Unfortunately the first part cannot
be accomplished directly since we are not in Haskellised-C world.
<p>
We therefore employ a second family of magic infotables, indexed,
like the first, on the return representation, and therefore with
names of the form <code>stg_itoc_ret_REP_info</code>. This is
pushed onto the stack (note, tagged values have their tag zapped),
giving:
<pre>
| |
+--------+
| itbl * | -------> arbitrary machine code return itbl
+--------+
: VALUE : (the returned value, possibly multiple words)
+--------+
| itbl * | -------> stg_itoc_ret_REP_info, with code:
+--------+
pop myself (stg_itoc_ret_REP_info) off the stack
pop return value into R1/D1/F1
do standard machine code return to itbl at t.o.s.
</pre>
We then return to the scheduler, asking it to enter the itbl at
t.o.s. When entered, <code>stg_itoc_ret_REP_info</code> removes
itself from the stack, pops the return value into the relevant
return register, and returns to the itbl to which we were trying
to return in the first place.
<p>
Amazingly enough, this stuff all actually works!
<p><small>
<p><small>
<!-- hhmts start -->
Last modified: Fri Feb 1 16:14:11 GMT 2002
Last modified: Thursday February 7 15:33:49 GMT 2002
<!-- hhmts end -->
</small>
</body>
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment