|
|
# I know kung fu: learning STG by example
|
|
|
|
|
|
|
|
|
The STG machine is an essential part of GHC, the world's leading
|
|
|
Haskell compiler. It defines how the Haskell evaluation model
|
|
|
should be efficiently implemented on standard hardware. Despite
|
... | ... | @@ -10,8 +9,8 @@ its modern, eval/apply-based, pointer-tagged incarnation by a |
|
|
series of simple examples showing how Haskell source code is
|
|
|
compiled.
|
|
|
|
|
|
## What is STG, exactly?
|
|
|
|
|
|
## What is STG, exactly?
|
|
|
|
|
|
Haskell code being sucked through GHC has a complex lifecycle.
|
|
|
Broadly speaking, it transitions between five representations:
|
... | ... | @@ -32,43 +31,37 @@ Broadly speaking, it transitions between five representations: |
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
The path from C-- to assembly varies: the three possible backends
|
|
|
are C (`-fvia-c` (**[OUTDATED](https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/codegens.html#c-code-generator-fvia-c)**)), LLVM (`-fllvm`), and the default backend -- the
|
|
|
native code genarator (or NCG), which generates assembly directly
|
|
|
from the GHC-internal C-- data type.
|
|
|
|
|
|
|
|
|
STG is a simple functional language, rather like the more famous
|
|
|
Core language. It differs in the following main respects:
|
|
|
|
|
|
1. In its current incarnation, it isn't typed in the Haskell sense,
|
|
|
though it does know about *representation* types
|
|
|
though it does know about *representation* types
|
|
|
1. It is in administrative normal form (ANF), which is where every
|
|
|
subexpression is given a name
|
|
|
subexpression is given a name
|
|
|
1. Every $\\lambda$, constructor application, and primitive operator
|
|
|
is $\\eta$-expanded
|
|
|
is $\\eta$-expanded
|
|
|
1. It is annotated with a ton of information that the code
|
|
|
generator is interested in knowing
|
|
|
|
|
|
generator is interested in knowing
|
|
|
|
|
|
STG expressions can be one of the following:
|
|
|
|
|
|
1. Atoms (i.e. literals and variables)
|
|
|
1. `let`-bindings (both recursive and non-recursive) over another
|
|
|
expression, where let-bound things are one of:
|
|
|
|
|
|
- A function value with explicit lambdas
|
|
|
- An unsaturated application
|
|
|
- A constructor applied to atoms
|
|
|
- A thunk (i.e. any expression not fitting into one of the above
|
|
|
categories)
|
|
|
|
|
|
expression, where let-bound things are one of:
|
|
|
- A function value with explicit lambdas
|
|
|
- An unsaturated application
|
|
|
- A constructor applied to atoms
|
|
|
- A thunk (i.e. any expression not fitting into one of the above
|
|
|
categories)
|
|
|
1. Saturated primitive application of a primitive to variables
|
|
|
1. Application of a variable to one or more atoms
|
|
|
1. Case deconstruction of an expression, where each branch may also
|
|
|
be an expression
|
|
|
|
|
|
be an expression
|
|
|
|
|
|
The job of the *STG machine* is to evaluate these expressions in
|
|
|
a way which is efficiently implementable on standard hardware. This
|
... | ... | @@ -76,7 +69,6 @@ document will look at how exactly this is achieved by looking at |
|
|
real examples of the C-- code GHC generates for various Haskell
|
|
|
expressions.
|
|
|
|
|
|
|
|
|
This document will take a very low-level view of the machine, so if
|
|
|
you want to get comfortable with how the STG machine executes at a
|
|
|
more abstract level before reading this document, you might want to
|
... | ... | @@ -89,8 +81,8 @@ system has also been implemented as a Haskell program called |
|
|
[Bernie Pope](https://github.com/bjpop/ministg), for those who wish to
|
|
|
see it in action on some simple examples.
|
|
|
|
|
|
## An overview of the STG machine
|
|
|
|
|
|
## An overview of the STG machine
|
|
|
|
|
|
Before we dive in, a note: this document will describe the STG
|
|
|
machine as it is implemented on x86-style architectures. I will use
|
... | ... | @@ -99,7 +91,6 @@ x86 by GHC" interchangeably. The implementation is somewhat |
|
|
different on x64, not least due to the greater number of available
|
|
|
registers.
|
|
|
|
|
|
|
|
|
This overview section is rather bare. Readers might be able to fill
|
|
|
in any gaps in my explanation by using some of the following
|
|
|
sources:
|
... | ... | @@ -107,58 +98,58 @@ sources: |
|
|
- [The Haskell Execution Model](https://gitlab.haskell.org/trac/ghc/wiki/Commentary/Rts/HaskellExecution)
|
|
|
- [Storage](https://gitlab.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage)
|
|
|
- [The Spineless Tagless G-machine](http://research.microsoft.com/en-us/um/people/simonpj/Papers/spineless-tagless-gmachine.ps.gz)
|
|
|
|
|
|
- now sadly rather out of date
|
|
|
- [Faster laziness through dynamic pointer tagging](http://research.microsoft.com/en-us/um/people/simonpj/papers/ptr-tag/ptr-tagging.pdf)
|
|
|
|
|
|
### Components of the machine
|
|
|
|
|
|
### Components of the machine
|
|
|
|
|
|
In its bare essentials, the STG machine consists of three parts:
|
|
|
|
|
|
1. The STG registers:
|
|
|
|
|
|
- There are rather a lot of registers here: more than can be
|
|
|
practicably stored in actual available processor registers on most
|
|
|
architectures.
|
|
|
- To deal with the lack of processor registers, most of the STG
|
|
|
registers are actually kept on the stack in a block of memory
|
|
|
pointed to by a special STG register called the "base register" (or
|
|
|
`BaseReg`). To get or set values of registers which are not kept in
|
|
|
processor registers, the STG machine generates an instruction to
|
|
|
load or store from an address relative to the `BaseReg`.
|
|
|
- The most important four registers are the `BaseReg`, the stack
|
|
|
pointer (`Sp`), the heap pointer (`Hp`), and the general purpose
|
|
|
register `R1` which is used for intermediate values, as well as for
|
|
|
returning evaluated values when unwinding the stack. These are the
|
|
|
four registers which are assigned actual processor registers when
|
|
|
implementing the STG machine on x86.
|
|
|
- There are rather a lot of registers here: more than can be
|
|
|
practicably stored in actual available processor registers on most
|
|
|
architectures.
|
|
|
- To deal with the lack of processor registers, most of the STG
|
|
|
registers are actually kept on the stack in a block of memory
|
|
|
pointed to by a special STG register called the "base register" (or
|
|
|
`BaseReg`). To get or set values of registers which are not kept in
|
|
|
processor registers, the STG machine generates an instruction to
|
|
|
load or store from an address relative to the `BaseReg`.
|
|
|
- The most important four registers are the `BaseReg`, the stack
|
|
|
pointer (`Sp`), the heap pointer (`Hp`), and the general purpose
|
|
|
register `R1` which is used for intermediate values, as well as for
|
|
|
returning evaluated values when unwinding the stack. These are the
|
|
|
four registers which are assigned actual processor registers when
|
|
|
implementing the STG machine on x86.
|
|
|
|
|
|
1. The STG stack:
|
|
|
|
|
|
- Stores function arguments and continuations (i.e. the stack
|
|
|
frames which are executed when a function returns)
|
|
|
- Grows downwards in memory
|
|
|
- The top of the stack is pointed to by the STG register `Sp`, and
|
|
|
the maximum available stack pointer is stored in `SpLim`. There is
|
|
|
no frame pointer.
|
|
|
- Stores function arguments and continuations (i.e. the stack
|
|
|
frames which are executed when a function returns)
|
|
|
- Grows downwards in memory
|
|
|
- The top of the stack is pointed to by the STG register `Sp`, and
|
|
|
the maximum available stack pointer is stored in `SpLim`. There is
|
|
|
no frame pointer.
|
|
|
|
|
|
1. The heap:
|
|
|
|
|
|
- Used to store many different sorts of heap object: notably
|
|
|
functions, thunks and data constructors
|
|
|
- Grows upwards in memory, towards the stack
|
|
|
- All allocation occurs using a bump-allocator: the heap pointer is
|
|
|
simply incremented by the number of bytes desired (subject to to a
|
|
|
check that this does not exhaust available memory). The garbage
|
|
|
collector is responsible for moving objects out of the area of the
|
|
|
heap managed by the bump allocator and into the care of its
|
|
|
generational collector.
|
|
|
- The last address in the bump-allocated part of the heap that has
|
|
|
been used is pointed to by the STG register `Hp`, with `HpLim`
|
|
|
holding the maximum address available for bump-allocation.
|
|
|
- Used to store many different sorts of heap object: notably
|
|
|
functions, thunks and data constructors
|
|
|
- Grows upwards in memory, towards the stack
|
|
|
- All allocation occurs using a bump-allocator: the heap pointer is
|
|
|
simply incremented by the number of bytes desired (subject to to a
|
|
|
check that this does not exhaust available memory). The garbage
|
|
|
collector is responsible for moving objects out of the area of the
|
|
|
heap managed by the bump allocator and into the care of its
|
|
|
generational collector.
|
|
|
- The last address in the bump-allocated part of the heap that has
|
|
|
been used is pointed to by the STG register `Hp`, with `HpLim`
|
|
|
holding the maximum address available for bump-allocation.
|
|
|
|
|
|
### Important concepts in the machine
|
|
|
|
|
|
### Important concepts in the machine
|
|
|
|
|
|
Some of the key concepts in the STG machine include *closures*,
|
|
|
*info tables* and *entry code*. We tackle them in reverse
|
... | ... | @@ -193,7 +184,6 @@ Essentially a heap-allocated pair of the free variables of some |
|
|
code, and a pointer to its info table (i.e. its info pointer).
|
|
|
</td></tr></table>
|
|
|
|
|
|
|
|
|
For an example of how these parts work together, consider the
|
|
|
following code
|
|
|
|
... | ... | @@ -201,10 +191,8 @@ following code |
|
|
my_fun x zs = map (\y -> y + x) zs
|
|
|
```
|
|
|
|
|
|
|
|
|
The nested lambda will give rise to all of the above objects.
|
|
|
|
|
|
|
|
|
The closure will store a pointer to `x`'s closure (as it is a free
|
|
|
variable of the lambda), along with a pointer to an info table.
|
|
|
That info table will contain information relevant to a function
|
... | ... | @@ -215,50 +203,44 @@ implement the addition by combining the closure for the free |
|
|
variable `x` (taken from the closure) with the stack-passed `y`
|
|
|
variable's closure.
|
|
|
|
|
|
|
|
|
Upon entry to some code, pointers to closures are made available in
|
|
|
`R1`. That is to say, before entry code is jumped to, `R1` is set
|
|
|
up to point to the associated closure, so that the entry code can
|
|
|
access free variables (if any).
|
|
|
|
|
|
|
|
|
Closures for code which contain no free variables (such as the
|
|
|
closure for `True` and `False`, and functions applied to no
|
|
|
arguments such as `(:)` and `id`) are allocated statically by the
|
|
|
compiler in the same manner as info tables are.
|
|
|
|
|
|
### Overview of execution model of the machine
|
|
|
|
|
|
### Overview of execution model of the machine
|
|
|
|
|
|
This will be covered in more detail in the examples below, so I
|
|
|
will use this section to make some general points.
|
|
|
|
|
|
|
|
|
The goal of the STG machine is to reduce the current expression to
|
|
|
a value. When it has done so, it:
|
|
|
|
|
|
1. Stores a tagged pointer to evaluated closure in the STG register
|
|
|
`R1`
|
|
|
1. Jumps to the entry code of the info table pointed to by the
|
|
|
value at the top of the STG stack
|
|
|
|
|
|
- This may also be called the info table of the *continuation* of
|
|
|
the expression
|
|
|
|
|
|
value at the top of the STG stack
|
|
|
- This may also be called the info table of the *continuation* of
|
|
|
the expression
|
|
|
|
|
|
The continuation code is responsible for popping its info pointer
|
|
|
(and stack-allocated free variables, if any) from the stack before
|
|
|
returning.
|
|
|
|
|
|
|
|
|
Arguments are passed on the stack, and are popped by the callee.
|
|
|
Upon a jump to the entry code for a function, there are always
|
|
|
precisely as many arguments on the stack as the (statically known)
|
|
|
arity of that function, and those arguments will be followed by the
|
|
|
info pointer of a continuation.
|
|
|
|
|
|
## Saturated application to known functions
|
|
|
|
|
|
## Saturated application to known functions
|
|
|
|
|
|
Handling application in the STG machine is a big topic, and so in
|
|
|
this first section we only look at the case of *saturated*
|
... | ... | @@ -266,8 +248,8 @@ applications to *known* functions - i.e. those functions that the |
|
|
compiler statically knows information such as the entry code
|
|
|
pointer and arity for.
|
|
|
|
|
|
### Example 1: function application with sufficient stack space
|
|
|
|
|
|
### Example 1: function application with sufficient stack space
|
|
|
|
|
|
Application of functions is the bread and butter of the STG
|
|
|
machine. Correspondingly, this first Haskell program
|
... | ... | @@ -279,7 +261,6 @@ known_fun x = x |
|
|
known_app::()->Intknown_app_= known_fun 10
|
|
|
```
|
|
|
|
|
|
|
|
|
compiles to very simple C-- code
|
|
|
|
|
|
```
|
... | ... | @@ -292,14 +273,12 @@ Main_knownzuapp_entry(){cl3: |
|
|
_Main_knownzuapp_entry:Lcl3:movlL_stg_INTLIKE_closure$non_lazy_ptr,%eaxaddl$209,%eaxmovl%eax,(%ebp)jmp_Main_knownzufun_entry
|
|
|
```
|
|
|
|
|
|
|
|
|
The STG machine passes arguments to functions on the STG stack, and
|
|
|
a pointer to the stack top is stored in the STG register `Sp`.
|
|
|
Furthermore, because GHC currently uses the eval/apply variant of
|
|
|
the STG machine, exactly as many arguments as the function expects
|
|
|
to receive are guaranteed to present on the stack.
|
|
|
|
|
|
|
|
|
Therefore, upon entry to the `known_app` function, we are
|
|
|
guaranteed that the STG stack has a pointer to a closure of type
|
|
|
`()` on top of it. In order to call `known_fun`, we just modify the
|
... | ... | @@ -307,8 +286,8 @@ top of the stack to replace that pointer with a pointer to the |
|
|
statically allocated closure for the literal `10`, and then
|
|
|
tail-call into the entry code of `known_fun`.
|
|
|
|
|
|
### Example 2: function application that needs to grow the stack
|
|
|
|
|
|
### Example 2: function application that needs to grow the stack
|
|
|
|
|
|
This Haskell code is apparently little more complicated than the
|
|
|
previous example
|
... | ... | @@ -320,7 +299,6 @@ known_fun_2 x _= x |
|
|
known_app_2::()->Intknown_app_2_= known_fun_2 1010
|
|
|
```
|
|
|
|
|
|
|
|
|
however, it generates radically different C-- code:
|
|
|
|
|
|
```
|
... | ... | @@ -337,15 +315,14 @@ Main_knownzuappzu2_entry(){clE:if(Sp -4< SpLim)goto clH; |
|
|
_Main_knownzuappzu2_entry:LclE:leal-4(%ebp),%eaxcmpl84(%ebx),%eaxjbLclHmovlL_stg_INTLIKE_closure$non_lazy_ptr,%eaxaddl$209,%eaxmovl%eax,(%ebp)movlL_stg_INTLIKE_closure$non_lazy_ptr,%eaxaddl$209,%eaxmovl%eax,-4(%ebp)addl$-4,%ebpjmp_Main_knownzufunzu2_entryLclH:movl$_Main_knownzuappzu2_closure,%esijmp*-4(%ebx)
|
|
|
```
|
|
|
|
|
|
|
|
|
As before, upon entry the STG stack is guaranteed to have a single
|
|
|
closure pointer at its top. However, in order to call into
|
|
|
known_fun_2 we need at least two free stack slots at the top for
|
|
|
arguments, which means that we have to grow the stack by one word
|
|
|
before we can make the call.
|
|
|
|
|
|
#### Checking for sufficient stack space
|
|
|
|
|
|
#### Checking for sufficient stack space
|
|
|
|
|
|
First, we check to see if growing the stack would overflow
|
|
|
allocated stack space, by comparing the STG stack pointer register
|
... | ... | @@ -355,7 +332,6 @@ allocated stack space, by comparing the STG stack pointer register |
|
|
if(Sp -4< SpLim)goto clH;
|
|
|
```
|
|
|
|
|
|
|
|
|
(The stack grows downwards, hence the *subtraction* of 4 from the
|
|
|
current `Sp`). If the stack check fails, we branch to `clH`:
|
|
|
|
... | ... | @@ -365,7 +341,6 @@ clH: |
|
|
jump stg_gc_fun();
|
|
|
```
|
|
|
|
|
|
|
|
|
This stores the closure of the current function in `R1`, and then
|
|
|
jumps into the hand-written garbage collector code to force it to
|
|
|
grow the stack. After the stack has been grown, the collector will
|
... | ... | @@ -374,8 +349,8 @@ stored in the (statically-allocated) `Main_knownzuappzu2_closure` |
|
|
closure pointed to by `R1`, and the stack check will be run again -
|
|
|
hopefully succeeding this time!
|
|
|
|
|
|
#### Making the known call
|
|
|
|
|
|
#### Making the known call
|
|
|
|
|
|
Given that the stack check succeeds, it is easy to make the actual
|
|
|
call we are after. We simply grow the stack by the required amount,
|
... | ... | @@ -389,7 +364,6 @@ course): |
|
|
Sp = Sp -4;
|
|
|
```
|
|
|
|
|
|
|
|
|
A simple tail call to the new function finishes us off:
|
|
|
|
|
|
```
|
... | ... | @@ -398,7 +372,6 @@ A simple tail call to the new function finishes us off: |
|
|
|
|
|
## Example 3: Unsaturated applications to known functions
|
|
|
|
|
|
|
|
|
Despite describing an undersaturated call, this Haskell code
|
|
|
|
|
|
```
|
... | ... | @@ -408,7 +381,6 @@ known_fun_2 x _= x |
|
|
known_undersaturated_app::()->Int->Intknown_undersaturated_app_= known_fun_2 10
|
|
|
```
|
|
|
|
|
|
|
|
|
compiles to straightforward C-- as follows
|
|
|
|
|
|
```
|
... | ... | @@ -421,15 +393,14 @@ Main_knownzuundersaturatedzuapp_entry(){cmd: |
|
|
_Main_knownzuundersaturatedzuapp_entry:Lcmd:movlL_stg_INTLIKE_closure$non_lazy_ptr,%eaxaddl$209,%eaxmovl%eax,(%ebp)jmp_Main_knownzufunzu2_entry
|
|
|
```
|
|
|
|
|
|
|
|
|
The reason that there is no special magic to deal with
|
|
|
undersaturated applications to known functions is simple: GHC
|
|
|
simply gives `known_undersaturated_app` an arity of 2, so by the
|
|
|
time we jump to the entry code the stack must already contain any
|
|
|
arguments required by `known_fun_2`.
|
|
|
|
|
|
## Example 4: Applications to unknown functions
|
|
|
|
|
|
## Example 4: Applications to unknown functions
|
|
|
|
|
|
We aren't going to tackle oversaturated calls to known functions
|
|
|
until we've considered happens to calls to statically-unknown
|
... | ... | @@ -440,7 +411,6 @@ following Haskell code |
|
|
unknown_app::(Int->Int)->Int->Intunknown_app f x = f x
|
|
|
```
|
|
|
|
|
|
|
|
|
Which compiles to this C-- function
|
|
|
|
|
|
```
|
... | ... | @@ -454,7 +424,6 @@ Main_unknownzuapp_entry(){cnO: |
|
|
_Main_unknownzuapp_entry:Lcn0:movl(%ebp),%esiaddl$4,%ebpjmp_stg_ap_p_fast
|
|
|
```
|
|
|
|
|
|
|
|
|
Unlike the previous cases we have looked at, we are compiling an
|
|
|
application where we don't statically know either the arity or the
|
|
|
info pointer of the function being applied. To deal with such
|
... | ... | @@ -462,43 +431,41 @@ cases, the STG machine uses several pre-compiled "generic apply" |
|
|
functions which inspect the info-table for the function in question
|
|
|
and decide how the available arguments should be applied to it.
|
|
|
|
|
|
### Dealing with generic application
|
|
|
|
|
|
### Dealing with generic application
|
|
|
|
|
|
There are three cases the generic apply functions have to deal
|
|
|
with:
|
|
|
|
|
|
1. The function's arity (recorder in the function closure's info
|
|
|
table) exactly matches the number of arguments available on the
|
|
|
stack
|
|
|
table) exactly matches the number of arguments available on the
|
|
|
stack
|
|
|
|
|
|
- This is the best case. In this case, the generic apply function
|
|
|
simply makes a tail call into the function's entry code
|
|
|
- This is the best case. In this case, the generic apply function
|
|
|
simply makes a tail call into the function's entry code
|
|
|
|
|
|
1. The function's arity is greater than the number of arguments
|
|
|
available on the stack
|
|
|
available on the stack
|
|
|
|
|
|
- In this case, the generic apply code allocates a PAP (partial
|
|
|
application) closure which closes over both the new arguments and
|
|
|
the function pointer, and returns that value, in the normal STGish
|
|
|
way, to the continuation on the top of the stack
|
|
|
- In this case, the generic apply code allocates a PAP (partial
|
|
|
application) closure which closes over both the new arguments and
|
|
|
the function pointer, and returns that value, in the normal STGish
|
|
|
way, to the continuation on the top of the stack
|
|
|
|
|
|
1. The function's arity is less than the number of arguments
|
|
|
available on the stack
|
|
|
|
|
|
- In this case, a number of arguments matching the arity are pushed
|
|
|
on top of the stack, followed by a continuation which uses another
|
|
|
of the generic apply functions to apply the remaining arguments.
|
|
|
The code for the original function is then entered
|
|
|
- Eventually the code for the continuation is entered and another
|
|
|
generic apply function will be tail-called to deal with the
|
|
|
result
|
|
|
available on the stack
|
|
|
|
|
|
- In this case, a number of arguments matching the arity are pushed
|
|
|
on top of the stack, followed by a continuation which uses another
|
|
|
of the generic apply functions to apply the remaining arguments.
|
|
|
The code for the original function is then entered
|
|
|
- Eventually the code for the continuation is entered and another
|
|
|
generic apply function will be tail-called to deal with the
|
|
|
result
|
|
|
|
|
|
Potentially, one generic apply function is required for every
|
|
|
"argument pattern". Some example argument patterns are:
|
|
|
|
|
|
|
|
|
Because the number of patterns is large (actually unbounded,
|
|
|
because functions might be of any arity), GHC only generates
|
|
|
generic apply functions for enough patterns so that 99.9% of all
|
... | ... | @@ -508,8 +475,8 @@ chaining together several smaller generic apply functions, in a |
|
|
similar manner as when dealing with oversaturated function
|
|
|
applications.
|
|
|
|
|
|
### Making the call to the generic application code
|
|
|
|
|
|
### Making the call to the generic application code
|
|
|
|
|
|
Let's remind ourselves of the original code:
|
|
|
|
... | ... | @@ -519,15 +486,14 @@ Let's remind ourselves of the original code: |
|
|
jump stg_ap_p_fast();
|
|
|
```
|
|
|
|
|
|
|
|
|
Knowing about generic apply functions, the call itself is easy to
|
|
|
understand. We pop the top of the stack (the function argument)
|
|
|
into `R1` and then jump into the generic application code for the
|
|
|
case where the stack contains a single pointer argument, which
|
|
|
deals with all the cases for `f` described above.
|
|
|
|
|
|
## Example 5: oversaturated applications to known functions
|
|
|
|
|
|
## Example 5: oversaturated applications to known functions
|
|
|
|
|
|
This Haskell code
|
|
|
|
... | ... | @@ -538,7 +504,6 @@ known_fun_2 x _= x |
|
|
known_oversat_app::()->Intknown_oversat_app_= known_fun_2 id id 10
|
|
|
```
|
|
|
|
|
|
|
|
|
compiles to the following C-- function
|
|
|
|
|
|
```
|
... | ... | @@ -557,12 +522,10 @@ Main_knownzuoversatzuapp_entry(){cmj:if(Sp -12< SpLim)goto cmm; |
|
|
_Main_knownzuoversatzuapp_entry:Lcmj:leal-12(%ebp),%eaxcmpl84(%ebx),%eaxjbLcmmmovlL_stg_INTLIKE_closure$non_lazy_ptr,%eaxaddl$209,%eaxmovl%eax,(%ebp)movlL_stg_ap_p_info$non_lazy_ptr,%eaxmovl%eax,-4(%ebp)movl$_base_GHCziBase_id_closure,-8(%ebp)movl$_base_GHCziBase_id_closure,-12(%ebp)addl$-12,%ebpjmp_Main_knownzufunzu2_entryLcmm:movl$_Main_knownzuoversatzuapp_closure,%esijmp*-4(%ebx)
|
|
|
```
|
|
|
|
|
|
|
|
|
As you might see, despite being a call to a known function, this
|
|
|
code makes use of the generic apply functions we discussed in the
|
|
|
last section. Let's pick the function apart and see how it works.
|
|
|
|
|
|
|
|
|
First, we do the usual stack check. What differs from the last time
|
|
|
we saw this check is that we are not only allocating space for
|
|
|
arguments on the stack, but also for a *continuation*. We set up
|
... | ... | @@ -576,7 +539,6 @@ these new stack entries as follows: |
|
|
Sp = Sp -12;
|
|
|
```
|
|
|
|
|
|
|
|
|
i.e. the final stack looks as follows (note that the code
|
|
|
overwrites the old pointer to a closure of type ()):
|
|
|
|
... | ... | @@ -595,14 +557,12 @@ overwrites the old pointer to a closure of type ()): |
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
Because `known_fun_2` is of arity 2, when we jump to its entry
|
|
|
code, it will only consume the top two arguments from the stack:
|
|
|
i.e. the two pointers to `base_GHCziBase_id_closure`. It will then
|
|
|
evaluate to some sort of value and transfer control to the entry
|
|
|
code for `stg_ap_p_info`.
|
|
|
|
|
|
|
|
|
This is where the magic happens: the entry code for `stg_ap_p_info`
|
|
|
will apply the function value that was returned from `known_fun_2`
|
|
|
to the (pointer) argument in the "free variable" of its (stack
|
... | ... | @@ -612,12 +572,11 @@ allocated) closure -- and we have arranged that that is |
|
|
for calls to unknown functions, so this will make use of the
|
|
|
`stg_ap_p_fast` function we saw before.
|
|
|
|
|
|
|
|
|
Finally, control will be transferred back to the caller for
|
|
|
`known_oversat_app`, and all will be well.
|
|
|
|
|
|
## Example 6: allocation of thunks and data
|
|
|
|
|
|
## Example 6: allocation of thunks and data
|
|
|
|
|
|
Something that happens all the time in Haskell is allocation. There
|
|
|
are three principal types of thing that get allocated: function
|
... | ... | @@ -640,7 +599,6 @@ common characteristics: |
|
|
- Info-tables containing various miscellaneous metadata about the
|
|
|
heap object, such as function arity
|
|
|
|
|
|
|
|
|
Let us look at how a thunk and a data constructor get allocated in
|
|
|
a simple setting:
|
|
|
|
... | ... | @@ -648,7 +606,6 @@ a simple setting: |
|
|
build_data::Int->MaybeIntbuild_data x =Just(x +1)
|
|
|
```
|
|
|
|
|
|
|
|
|
This compiles into the following C--:
|
|
|
|
|
|
```
|
... | ... | @@ -670,18 +627,16 @@ Main_buildzudata_entry(){clE: |
|
|
_Main_buildzudata_entry:LclE:addl$20,%edicmpl92(%ebx),%edijaLclHmovl$_slk_info,-16(%edi)movl(%ebp),%eaxmovl%eax,-8(%edi)movl$_base_DataziMaybe_Just_con_info,-4(%edi)leal-16(%edi),%eaxmovl%eax,(%edi)leal-2(%edi),%esiaddl$4,%ebpmovl(%ebp),%eaxjmp*(%eax)LclH:movl$20,112(%ebx)LclI:movl$_Main_buildzudata_closure,%esijmp*-4(%ebx)
|
|
|
```
|
|
|
|
|
|
|
|
|
Let's break this function down slowly.
|
|
|
|
|
|
### Checking for sufficient heap space
|
|
|
|
|
|
### Checking for sufficient heap space
|
|
|
|
|
|
Any function that needs to allocate memory might find that the heap
|
|
|
has been exhausted. If that happens, it needs to call into the
|
|
|
garbage collector in order to get the heap cleaned up and
|
|
|
(possibly) enlarged.
|
|
|
|
|
|
|
|
|
Hence, the first thing any such function does is check to see if
|
|
|
enough memory is available for its purposes:
|
|
|
|
... | ... | @@ -693,14 +648,12 @@ clE: |
|
|
HpAlloc =20;goto clI;
|
|
|
```
|
|
|
|
|
|
|
|
|
This is simple enough. The function needs to allocate 20 bytes (the
|
|
|
data constructor takes up 2 words, and the thunk will take up 3),
|
|
|
so it speculatively increments Hp and then checks the STG registers
|
|
|
`Hp` and `HpLim` (the pointer to the top of the available heap
|
|
|
space) against each other.
|
|
|
|
|
|
|
|
|
If memory is insufficient (i.e. we have moved `Hp` past the top of
|
|
|
the available heap), the code deals with it by setting the
|
|
|
`HpAlloc` register to the number of bytes needed and `R1` to the
|
... | ... | @@ -710,7 +663,6 @@ cleanup. The garbage collector will resume execution of the code by |
|
|
using the information from `R1`, after it has freed up enough
|
|
|
memory.
|
|
|
|
|
|
|
|
|
Side note: I believe that the line setting `R1` is unnecessary
|
|
|
here, because `R1` should anyway always be set to the address of
|
|
|
the closure when executing the closure entry code. I could be
|
... | ... | @@ -718,7 +670,6 @@ wrong, though. |
|
|
|
|
|
### Performing the actual allocation
|
|
|
|
|
|
|
|
|
Once the heap check succeeds, we will be able to enter the body of
|
|
|
the function proper. Since the `Hp` has already been incremented,
|
|
|
we can just construct the new heap objects directly:
|
... | ... | @@ -730,7 +681,6 @@ we can just construct the new heap objects directly: |
|
|
I32[Hp +0]= Hp -16;
|
|
|
```
|
|
|
|
|
|
|
|
|
So we get something like this:
|
|
|
|
|
|
```wiki
|
... | ... | @@ -750,12 +700,11 @@ So we get something like this: |
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
The bottom two words are the allocated `Just` value, and the three
|
|
|
above that correspond to the `x + 1` closure.
|
|
|
|
|
|
### Returning an allocated value to the caller
|
|
|
|
|
|
### Returning an allocated value to the caller
|
|
|
|
|
|
Now that we have allocated the data we entered the function in
|
|
|
order to construct, we need to return it to the caller. This is
|
... | ... | @@ -767,49 +716,41 @@ achieved by the following code: |
|
|
jump (I32[I32[Sp +0]])();
|
|
|
```
|
|
|
|
|
|
|
|
|
To return, the STG machine:
|
|
|
|
|
|
1. Sets `R1` to the pointer to the result of evaluation
|
|
|
1. Pops all the arguments to the function from the stack
|
|
|
1. Jumps to the entry code for the continuation. This is always
|
|
|
found at the top of the STG stack, logically below any arguments
|
|
|
that were pushed to make the call.
|
|
|
|
|
|
found at the top of the STG stack, logically below any arguments
|
|
|
that were pushed to make the call.
|
|
|
|
|
|
This is indeed exactly what happens here, with two interesting
|
|
|
points: pointer tagging, and the double-deference of the stack
|
|
|
pointer. These will be discussed in the next two subsections.
|
|
|
|
|
|
#### Pointer tagging
|
|
|
|
|
|
#### Pointer tagging
|
|
|
|
|
|
One exciting feature is that the code setting `R1`, i.e.
|
|
|
`R1 = Hp - 2`. This is setting `R1` to point to the `Just`, we just
|
|
|
allocated, but simultaneously tagging that pointer with the value
|
|
|
|
|
|
1. The fact that the tag is non-zero indicates to users of the
|
|
|
|
|
|
|
|
|
allocated, but simultaneously tagging that pointer with the
|
|
|
value 1. The fact that the tag is non-zero indicates to users of the
|
|
|
pointer that the thing pointed to is already evaluated.
|
|
|
Furthermore, because `Maybe` has only two constructors, we are able
|
|
|
to use the pointer tags to record which constructor it evaluated
|
|
|
to: in this case, the 2 indicates the `Just` constructor.
|
|
|
|
|
|
|
|
|
It is compulsory to tag pointers before jumping to the address of
|
|
|
the continuation entry code: the entry code can and will rely on
|
|
|
those tags being present!
|
|
|
|
|
|
#### `TABLES_NEXT_TO_CODE`
|
|
|
|
|
|
|
|
|
Because I have compiled GHC without `TABLES_NEXT_TO_CODE`, the
|
|
|
entry code for the continuation is found by dereferencing the
|
|
|
pointer to the info table we found at the top of the STG stack -
|
|
|
i.e. a double-dereference.
|
|
|
|
|
|
|
|
|
The layout of heap objects without `TABLES_NEXT_TO_CODE` is as
|
|
|
follows:
|
|
|
|
... | ... | @@ -823,7 +764,6 @@ follows: |
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
With `TABLES_NEXT_TO_CODE` on, the situation looks more like this:
|
|
|
|
|
|
```wiki
|
... | ... | @@ -836,7 +776,6 @@ With `TABLES_NEXT_TO_CODE` on, the situation looks more like this: |
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
The `TABLES_NEXT_TO_CODE` optimisation removes the need for that
|
|
|
second dereference during the return, because the entry code is
|
|
|
always right next to the info table. However, it requires special
|
... | ... | @@ -844,8 +783,8 @@ support from the backend for ensuring that data (i.e. the info |
|
|
table) and code are contiguous in memory, so it cannot always be
|
|
|
used.
|
|
|
|
|
|
## Example 7: `case` expressions
|
|
|
|
|
|
## Example 7: `case` expressions
|
|
|
|
|
|
Let us now examine how `case` expressions are handled. Compiling
|
|
|
the following Haskell
|
... | ... | @@ -855,7 +794,6 @@ case_scrut :: Maybe Int -> Int |
|
|
case_scrut x = case x of Just x -> x; Nothing -> 10
|
|
|
```
|
|
|
|
|
|
|
|
|
Produces this C-- code
|
|
|
|
|
|
```
|
... | ... | @@ -880,15 +818,14 @@ scj_ret(){cct: |
|
|
_Main_casezuscrut_entry:Lccx:movl(%ebp),%esimovl$_scj_info,(%ebp)testl$3,%esijneLccAmovl(%esi),%eaxjmp*(%eax)LccA:jmp*_scj_info_scj_ret:Lcct:movl%esi,%eaxandl$3,%eaxcmpl$2,%eaxjaeLccvmovlL_stg_INTLIKE_closure$non_lazy_ptr,%eaxleal209(%eax),%esiaddl$4,%ebpmovl(%ebp),%eaxjmp*(%eax)Lccv:movl2(%esi),%esiaddl$4,%ebpandl$-4,%esimovl(%esi),%eaxjmp*(%eax)
|
|
|
```
|
|
|
|
|
|
|
|
|
Notice that GHC has generated *two* functions:
|
|
|
`Main_casezuscrut_entry` and `scj_ret` correspond to the code for
|
|
|
forcing the argument to the `case`, and for the *continuation* of
|
|
|
the `case` respectively. Let's pick them apart and see how they
|
|
|
work!
|
|
|
|
|
|
### Forcing the scrutinee of the `case`
|
|
|
|
|
|
### Forcing the scrutinee of the `case`
|
|
|
|
|
|
When we first call the `case_scrut` function, its entry code begins
|
|
|
executing:
|
... | ... | @@ -901,7 +838,6 @@ ccx: |
|
|
jump (I32[scj_info])();
|
|
|
```
|
|
|
|
|
|
|
|
|
This is a function of arity 1 (i.e. with a single argument), so
|
|
|
upon entry the machine state looks like this:
|
|
|
|
... | ... | @@ -918,13 +854,11 @@ upon entry the machine state looks like this: |
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
Because this is a top level function, the closure is statically
|
|
|
allocated and contains no free variables. However, as discussed
|
|
|
previously, the single argument to the function is guaranteed to be
|
|
|
present at the top of the stack.
|
|
|
|
|
|
|
|
|
The code starts off by saving this argument (the `x`) temporarily
|
|
|
into `R1`:
|
|
|
|
... | ... | @@ -933,7 +867,6 @@ ccx: |
|
|
R1 = I32[Sp +0];
|
|
|
```
|
|
|
|
|
|
|
|
|
The next thing the code does is overwrites this argument on the
|
|
|
stack with a pointer to the info-table of the continuation code.
|
|
|
This is the code that will be invoked after `x` has been evaluated
|
... | ... | @@ -944,7 +877,6 @@ as the `Nothing` or as the `Just` branch of the case: |
|
|
I32[Sp +0]= scj_info;
|
|
|
```
|
|
|
|
|
|
|
|
|
As we saw earlier, any time that the STG machine decides that it
|
|
|
has a value in its hand, it will continue evaluation by
|
|
|
tail-calling the entry code found by dereferencing the info-table
|
... | ... | @@ -952,13 +884,11 @@ pointer at the top of the stack. So by putting the address of our |
|
|
continuation in here, we ensure that the entry code for `scj_info`
|
|
|
is executed after `x` becomes a value.
|
|
|
|
|
|
|
|
|
Now, what we need to do is to start the evaluation of `x`. We could
|
|
|
just jump into `x`'s entry code and hope for the best, but thanks
|
|
|
to GHC's pointer tagging we can sometimes avoid doing this indirect
|
|
|
branch.
|
|
|
|
|
|
|
|
|
So, instead, we test to see if the `x` pointer has a tag. If it is
|
|
|
tagged, then we know that it is already evaluated and hence jump
|
|
|
directly to the code for the continuation. If it is not tagged, we
|
... | ... | @@ -971,14 +901,12 @@ if(R1 &3!=0)goto ccA; |
|
|
jump (I32[scj_info])();
|
|
|
```
|
|
|
|
|
|
|
|
|
Note the test `R1 & 3 != 0`: this reflects the fact that pointer
|
|
|
tags are stored in the lower 2 bits of the pointer on 32 bit
|
|
|
machines. Another interesting feature is how the `jump`
|
|
|
instructions find the entry code: again, we see a deference of the
|
|
|
info pointer because `TABLES_NEXT_TO_CODE` is turned off.
|
|
|
|
|
|
|
|
|
As we saw, the `case` scrutinisation code ended with one of two
|
|
|
things happening: 1. A direct call into the continuation code
|
|
|
`scj_ret` if the scrutinee was already evaluated 2. A call into the
|
... | ... | @@ -988,12 +916,11 @@ with that information) - Because we pushed `scj_info` onto the STG |
|
|
stack, control will eventually return to `scj_ret` after the
|
|
|
evaluation of `x` has finished
|
|
|
|
|
|
|
|
|
It is now time to examine the continuation code to see what happens
|
|
|
after `x` becomes a value.
|
|
|
|
|
|
### Dealing with the forced scrutinee
|
|
|
|
|
|
### Dealing with the forced scrutinee
|
|
|
|
|
|
The continuation code is a little more complicated:
|
|
|
|
... | ... | @@ -1009,7 +936,6 @@ cct: |
|
|
jump (I32[I32[R1]])();
|
|
|
```
|
|
|
|
|
|
|
|
|
Whenever the STG machine evaluates to a value it will return the
|
|
|
value by jumping to the entry point at the top of the stack. In
|
|
|
this case, `R1` is guaranteed to be a (tagged) pointer to the thing
|
... | ... | @@ -1030,7 +956,6 @@ constructor it was evaluated to. In this case, we would have to |
|
|
read the constructor tag by dereferencing `R1` and testing the
|
|
|
resulting info table pointer against all possibilities.
|
|
|
|
|
|
|
|
|
If the tag was greater than or equal to 2, we go to the `ccv`
|
|
|
branch, which deals with what happens if we had a `Just`. In this
|
|
|
case, we need to continue by forcing the thunk inside the `Just`
|
... | ... | @@ -1045,7 +970,6 @@ ccv: |
|
|
jump (I32[I32[R1]])();
|
|
|
```
|
|
|
|
|
|
|
|
|
To access the thing inside the `Just`, the code assumes that the
|
|
|
`R1` pointer is tagged with the 2 that indicates a `Just`
|
|
|
constructor, and hence finds the first free variable (stored 4
|
... | ... | @@ -1056,14 +980,13 @@ the stack in `Main_casezuscrut_entry` by moving `Sp` up 4 bytes |
|
|
jumps into the entry code for the `R1` thunk, using the same
|
|
|
double-dereference pattern discussed earlier.
|
|
|
|
|
|
|
|
|
There seems to be a small missed opportunity here: the code could
|
|
|
check the pointer tag on `R1`, and then return directly if it is
|
|
|
set. I imagine that this isn't being done in order to reduce
|
|
|
possible code bloat.
|
|
|
|
|
|
## Example 8: thunks and thunk update
|
|
|
|
|
|
## Example 8: thunks and thunk update
|
|
|
|
|
|
You might be wondering how the `x + 1` thunk we saw allocated in a
|
|
|
previous section will behave when it is actually forced. To remind
|
... | ... | @@ -1074,7 +997,6 @@ code: |
|
|
build_data::Int->MaybeIntbuild_data x =Just(x +1)
|
|
|
```
|
|
|
|
|
|
|
|
|
So how does the `x + 1` thunk work? An excellent question! Let's
|
|
|
take a look at the C-- for its entry code and find out:
|
|
|
|
... | ... | @@ -1102,7 +1024,6 @@ soN_ret(){cp7: |
|
|
_sst_entry:Lcph:leal-12(%ebp),%eaxcmpl84(%ebx),%eaxjbLcpjmovlL_stg_upd_frame_info$non_lazy_ptr,%eaxmovl%eax,-8(%ebp)movl%esi,-4(%ebp)movl8(%esi),%esimovl$_soN_info,-12(%ebp)addl$-12,%ebptestl$3,%esijneLcpkmovl(%esi),%eaxjmp*(%eax)Lcpj:jmp*-8(%ebx)Lcpk:jmp*_soN_info_soN_ret:Lcp7:addl$8,%edicmpl92(%ebx),%edijaLcpcmovl3(%esi),%eaxincl%eaxmovl$_ghczmprim_GHCziTypes_Izh_con_info,-4(%edi)movl%eax,(%edi)leal-3(%edi),%esiaddl$4,%ebpmovlL_stg_upd_frame_info$non_lazy_ptr,%eaxjmp*(%eax)Lcpc:movl$8,112(%ebx)Lcpd:jmp*-8(%ebx)
|
|
|
```
|
|
|
|
|
|
|
|
|
The original Haskell code read `x + 1`, but GHC has inlined the
|
|
|
actual code for the addition operation on `Int`s, which looks
|
|
|
something like:
|
... | ... | @@ -1111,15 +1032,14 @@ something like: |
|
|
plusInt(I# a)(I# b)=I#(a + b)
|
|
|
```
|
|
|
|
|
|
|
|
|
The second pattern match (to get `b`) has been performed statically
|
|
|
by GHC, obtaining the machine literal 1, which shows up directly in
|
|
|
the generated code. Therefore, the code only need to evaluate and
|
|
|
case-decompose the unknown free variable `x` of our closure, to get
|
|
|
the `a` argument to `plusInt`.
|
|
|
|
|
|
### Thunk entry point
|
|
|
|
|
|
### Thunk entry point
|
|
|
|
|
|
This evaluation is what is being done by the thunk entry code
|
|
|
`slk_entry`. Ignoring the stack check, the C-- begins thusly:
|
... | ... | @@ -1132,7 +1052,6 @@ This evaluation is what is being done by the thunk entry code |
|
|
Sp = Sp -12;
|
|
|
```
|
|
|
|
|
|
|
|
|
Remembering that upon entry to the thunk entry code, `R1` points to
|
|
|
the thunk's closure, the new stack looks as follows:
|
|
|
|
... | ... | @@ -1151,12 +1070,10 @@ the thunk's closure, the new stack looks as follows: |
|
|
+-------------+
|
|
|
```
|
|
|
|
|
|
|
|
|
The C-- statement `R1 = I32[R1 + 8]` is pulling out the pointer to
|
|
|
the free variable of the thunk (which was set up in
|
|
|
`Main_buildzudata_entry`) into `R1`.
|
|
|
|
|
|
|
|
|
Finally, the entry code evaluates that free variable (checking the
|
|
|
tag bits of the pointer first, as usual):
|
|
|
|
... | ... | @@ -1165,18 +1082,15 @@ if(R1 &3!=0)goto cpk; |
|
|
jump (I32[I32[R1]])();cpk: jump (I32[soN_info])();
|
|
|
```
|
|
|
|
|
|
|
|
|
Because we put `soN_info` at the top of the stack, when evaluation
|
|
|
of `x` is complete the STG machine will continue by executing the
|
|
|
`soN_ret` code.
|
|
|
|
|
|
|
|
|
The most interesting feature of this code is the extra stuff that
|
|
|
has been pushed onto the stack below `soN_ret`: an info pointer
|
|
|
called `stg_upd_frame_info`, and a pointer to the thunk currently
|
|
|
being evaluated.
|
|
|
|
|
|
|
|
|
This is all part of the STG machine's thunk update mechanism. When
|
|
|
the `soN_ret` continuation returns, it will transfer control
|
|
|
*not* to the code forcing the thunk, but to some code which
|
... | ... | @@ -1186,18 +1100,16 @@ indirection closure is trivial: it immediately returns a pointer to |
|
|
the thing that was returned from the `soN_ret` continuation in
|
|
|
`R1`.
|
|
|
|
|
|
|
|
|
These indirections are the mechanism which ensures that the STG
|
|
|
machine never repeats the work of evaluating a thunk more than
|
|
|
once: after the first evaluation, any code forcing the thunk jumps
|
|
|
into the indirection entry code rather than `slk_entry`.
|
|
|
|
|
|
|
|
|
That being said, let us look at how the continuation responsible
|
|
|
for actually finding the value of `x + 1` works:
|
|
|
|
|
|
### Continuation of the thunk
|
|
|
|
|
|
### Continuation of the thunk
|
|
|
|
|
|
Upon entry to the continuation code, we have the evaluated `x` in
|
|
|
`R1`: it now needs to do the addition and allocate a `I#`
|
... | ... | @@ -1214,7 +1126,6 @@ check, we have the following code: |
|
|
jump (I32[stg_upd_frame_info])();
|
|
|
```
|
|
|
|
|
|
|
|
|
This is mostly standard stuff. Because the `R1` pointer is
|
|
|
guaranteed tagged, and there is only one possible constructor, the
|
|
|
tag must be 1 and so the `Int#` value inside the `Int` is pulled
|
... | ... | @@ -1222,15 +1133,14 @@ out using `I32[R1 + 3]`. This is then put into a newly |
|
|
heap-allocated `I#` constructor, which is returned in `R1` after we
|
|
|
pop the `soN_info` pointer from the stack.
|
|
|
|
|
|
|
|
|
The only interesting point is where we return to: rather than
|
|
|
dereference `Sp` to find the info pointer at the top of the STG
|
|
|
stack, GHC has generated code that takes advantage of the fact that
|
|
|
the `Sp` is guaranteed to point to `stg_upd_frame_info`. This
|
|
|
avoids one pointer dereference.
|
|
|
|
|
|
## Conclusion
|
|
|
|
|
|
## Conclusion
|
|
|
|
|
|
This document has left much of the detail of how STG is implemented
|
|
|
out: notable omissions include CAFs, and the precise behaviour of
|
... | ... | |