KevinBuhr · 0eb9a183
--- a/commentary/compiler/generated-code.md
+++ b/commentary/compiler/generated-code.md
 # I know kung fu: learning STG by example

-
 The STG machine is an essential part of GHC, the world's leading
 Haskell compiler. It defines how the Haskell evaluation model
 should be efficiently implemented on standard hardware. Despite
@@ -10,8 +9,8 @@ its modern, eval/apply-based, pointer-tagged incarnation by a
 series of simple examples showing how Haskell source code is
 compiled.

-## What is STG, exactly?

+## What is STG, exactly?

 Haskell code being sucked through GHC has a complex lifecycle.
 Broadly speaking, it transitions between five representations:
@@ -32,43 +31,37 @@ Broadly speaking, it transitions between five representations:

 ```

-
 The path from C-- to assembly varies: the three possible backends
 are C (`-fvia-c` (**[OUTDATED](https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/codegens.html#c-code-generator-fvia-c)**)), LLVM (`-fllvm`), and the default backend -- the
 native code genarator (or NCG), which generates assembly directly
 from the GHC-internal C-- data type.

-
 STG is a simple functional language, rather like the more famous
 Core language. It differs in the following main respects:

 1. In its current incarnation, it isn't typed in the Haskell sense,
-  though it does know about *representation* types
+   though it does know about *representation* types
 1. It is in administrative normal form (ANF), which is where every
-  subexpression is given a name
+   subexpression is given a name
 1. Every $\\lambda$, constructor application, and primitive operator
-  is $\\eta$-expanded
+   is $\\eta$-expanded
 1. It is annotated with a ton of information that the code
-  generator is interested in knowing
-
+   generator is interested in knowing

 STG expressions can be one of the following:

 1. Atoms (i.e. literals and variables)
 1. `let`-bindings (both recursive and non-recursive) over another
-  expression, where let-bound things are one of:
-
-  - A function value with explicit lambdas
-  - An unsaturated application
-  - A constructor applied to atoms
-  - A thunk (i.e. any expression not fitting into one of the above
-    categories)
-
+   expression, where let-bound things are one of:
+   - A function value with explicit lambdas
+   - An unsaturated application
+   - A constructor applied to atoms
+   - A thunk (i.e. any expression not fitting into one of the above
+     categories)
 1. Saturated primitive application of a primitive to variables
 1. Application of a variable to one or more atoms
 1. Case deconstruction of an expression, where each branch may also
-  be an expression
-
+   be an expression

 The job of the *STG machine* is to evaluate these expressions in
 a way which is efficiently implementable on standard hardware. This
@@ -76,7 +69,6 @@ document will look at how exactly this is achieved by looking at
 real examples of the C-- code GHC generates for various Haskell
 expressions.

-
 This document will take a very low-level view of the machine, so if
 you want to get comfortable with how the STG machine executes at a
 more abstract level before reading this document, you might want to
@@ -89,8 +81,8 @@ system has also been implemented as a Haskell program called
 [Bernie Pope](https://github.com/bjpop/ministg), for those who wish to
 see it in action on some simple examples.

-## An overview of the STG machine

+## An overview of the STG machine

 Before we dive in, a note: this document will describe the STG
 machine as it is implemented on x86-style architectures. I will use
@@ -99,7 +91,6 @@ x86 by GHC" interchangeably. The implementation is somewhat
 different on x64, not least due to the greater number of available
 registers.

-
 This overview section is rather bare. Readers might be able to fill
 in any gaps in my explanation by using some of the following
 sources:
@@ -107,58 +98,58 @@ sources:
 - [The Haskell Execution Model](https://gitlab.haskell.org/trac/ghc/wiki/Commentary/Rts/HaskellExecution)
 - [Storage](https://gitlab.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage)
 - [The Spineless Tagless G-machine](http://research.microsoft.com/en-us/um/people/simonpj/Papers/spineless-tagless-gmachine.ps.gz)
-
  - now sadly rather out of date
 - [Faster laziness through dynamic pointer tagging](http://research.microsoft.com/en-us/um/people/simonpj/papers/ptr-tag/ptr-tagging.pdf)

-### Components of the machine

+### Components of the machine

 In its bare essentials, the STG machine consists of three parts:

 1. The STG registers:

-  - There are rather a lot of registers here: more than can be
-    practicably stored in actual available processor registers on most
-    architectures.
-  - To deal with the lack of processor registers, most of the STG 
-    registers are actually kept on the stack in a block of memory
-    pointed to by a special STG register called the "base register" (or 
-    `BaseReg`). To get or set values of registers which are not kept in
-    processor registers, the STG machine generates an instruction to
-    load or store from an address relative to the `BaseReg`.
-  - The most important four registers are the `BaseReg`, the stack
-    pointer (`Sp`), the heap pointer (`Hp`), and the general purpose
-    register `R1` which is used for intermediate values, as well as for 
-    returning evaluated values when unwinding the stack. These are the 
-    four registers which are assigned actual processor registers when
-    implementing the STG machine on x86.
+   - There are rather a lot of registers here: more than can be
+     practicably stored in actual available processor registers on most
+     architectures.
+   - To deal with the lack of processor registers, most of the STG 
+     registers are actually kept on the stack in a block of memory
+     pointed to by a special STG register called the "base register" (or 
+     `BaseReg`). To get or set values of registers which are not kept in
+     processor registers, the STG machine generates an instruction to
+     load or store from an address relative to the `BaseReg`.
+   - The most important four registers are the `BaseReg`, the stack
+     pointer (`Sp`), the heap pointer (`Hp`), and the general purpose
+     register `R1` which is used for intermediate values, as well as for 
+     returning evaluated values when unwinding the stack. These are the 
+     four registers which are assigned actual processor registers when
+     implementing the STG machine on x86.
+
 1. The STG stack:

-  - Stores function arguments and continuations (i.e. the stack
-    frames which are executed when a function returns)
-  - Grows downwards in memory
-  - The top of the stack is pointed to by the STG register `Sp`, and 
-    the maximum available stack pointer is stored in `SpLim`. There is
-    no frame pointer.
+   - Stores function arguments and continuations (i.e. the stack
+     frames which are executed when a function returns)
+   - Grows downwards in memory
+   - The top of the stack is pointed to by the STG register `Sp`, and 
+     the maximum available stack pointer is stored in `SpLim`. There is
+     no frame pointer.

 1. The heap:

-  - Used to store many different sorts of heap object: notably
-    functions, thunks and data constructors
-  - Grows upwards in memory, towards the stack
-  - All allocation occurs using a bump-allocator: the heap pointer is
-    simply incremented by the number of bytes desired (subject to to a
-    check that this does not exhaust available memory). The garbage
-    collector is responsible for moving objects out of the area of the 
-    heap managed by the bump allocator and into the care of its 
-    generational collector.
-  - The last address in the bump-allocated part of the heap that has 
-    been used is pointed to by the STG register `Hp`, with `HpLim`
-    holding the maximum address available for bump-allocation.
+   - Used to store many different sorts of heap object: notably
+     functions, thunks and data constructors
+   - Grows upwards in memory, towards the stack
+   - All allocation occurs using a bump-allocator: the heap pointer is
+     simply incremented by the number of bytes desired (subject to to a
+     check that this does not exhaust available memory). The garbage
+     collector is responsible for moving objects out of the area of the 
+     heap managed by the bump allocator and into the care of its 
+     generational collector.
+   - The last address in the bump-allocated part of the heap that has 
+     been used is pointed to by the STG register `Hp`, with `HpLim`
+     holding the maximum address available for bump-allocation.

-### Important concepts in the machine

+### Important concepts in the machine

 Some of the key concepts in the STG machine include *closures*,
 *info tables* and *entry code*. We tackle them in reverse
@@ -193,7 +184,6 @@ Essentially a heap-allocated pair of the free variables of some
 code, and a pointer to its info table (i.e. its info pointer).
 </td></tr></table>

-
 For an example of how these parts work together, consider the
 following code

@@ -201,10 +191,8 @@ following code
 my_fun x zs = map (\y -> y + x) zs
 ```

-
 The nested lambda will give rise to all of the above objects.

-
 The closure will store a pointer to `x`'s closure (as it is a free
 variable of the lambda), along with a pointer to an info table.
 That info table will contain information relevant to a function
@@ -215,50 +203,44 @@ implement the addition by combining the closure for the free
 variable `x` (taken from the closure) with the stack-passed `y`
 variable's closure.

-
 Upon entry to some code, pointers to closures are made available in
 `R1`. That is to say, before entry code is jumped to, `R1` is set
 up to point to the associated closure, so that the entry code can
 access free variables (if any).

-
 Closures for code which contain no free variables (such as the
 closure for `True` and `False`, and functions applied to no
 arguments such as `(:)` and `id`) are allocated statically by the
 compiler in the same manner as info tables are.

-### Overview of execution model of the machine

+### Overview of execution model of the machine

 This will be covered in more detail in the examples below, so I
 will use this section to make some general points.

-
 The goal of the STG machine is to reduce the current expression to
 a value. When it has done so, it:

 1. Stores a tagged pointer to evaluated closure in the STG register
  `R1`
 1. Jumps to the entry code of the info table pointed to by the
-  value at the top of the STG stack
-
-  - This may also be called the info table of the *continuation* of
-    the expression
-
+   value at the top of the STG stack
+   - This may also be called the info table of the *continuation* of
+     the expression

 The continuation code is responsible for popping its info pointer
 (and stack-allocated free variables, if any) from the stack before
 returning.

-
 Arguments are passed on the stack, and are popped by the callee.
 Upon a jump to the entry code for a function, there are always
 precisely as many arguments on the stack as the (statically known)
 arity of that function, and those arguments will be followed by the
 info pointer of a continuation.

-## Saturated application to known functions

+## Saturated application to known functions

 Handling application in the STG machine is a big topic, and so in
 this first section we only look at the case of *saturated*
@@ -266,8 +248,8 @@ applications to *known* functions - i.e. those functions that the
 compiler statically knows information such as the entry code
 pointer and arity for.

-### Example 1: function application with sufficient stack space

+### Example 1: function application with sufficient stack space

 Application of functions is the bread and butter of the STG
 machine. Correspondingly, this first Haskell program
@@ -279,7 +261,6 @@ known_fun x = x
 known_app::()->Intknown_app_= known_fun 10
 ```

-
 compiles to very simple C-- code

 ```
@@ -292,14 +273,12 @@ Main_knownzuapp_entry(){cl3:
 _Main_knownzuapp_entry:Lcl3:movlL_stg_INTLIKE_closure$non_lazy_ptr,%eaxaddl$209,%eaxmovl%eax,(%ebp)jmp_Main_knownzufun_entry
 ```

-
 The STG machine passes arguments to functions on the STG stack, and
 a pointer to the stack top is stored in the STG register `Sp`.
 Furthermore, because GHC currently uses the eval/apply variant of
 the STG machine, exactly as many arguments as the function expects
 to receive are guaranteed to present on the stack.

-
 Therefore, upon entry to the `known_app` function, we are
 guaranteed that the STG stack has a pointer to a closure of type
 `()` on top of it. In order to call `known_fun`, we just modify the
@@ -307,8 +286,8 @@ top of the stack to replace that pointer with a pointer to the
 statically allocated closure for the literal `10`, and then
 tail-call into the entry code of `known_fun`.

-### Example 2: function application that needs to grow the stack

+### Example 2: function application that needs to grow the stack

 This Haskell code is apparently little more complicated than the
 previous example
@@ -320,7 +299,6 @@ known_fun_2 x _= x
 known_app_2::()->Intknown_app_2_= known_fun_2 1010
 ```

-
 however, it generates radically different C-- code:

 ```
@@ -337,15 +315,14 @@ Main_knownzuappzu2_entry(){clE:if(Sp -4< SpLim)goto clH;
 _Main_knownzuappzu2_entry:LclE:leal-4(%ebp),%eaxcmpl84(%ebx),%eaxjbLclHmovlL_stg_INTLIKE_closure$non_lazy_ptr,%eaxaddl$209,%eaxmovl%eax,(%ebp)movlL_stg_INTLIKE_closure$non_lazy_ptr,%eaxaddl$209,%eaxmovl%eax,-4(%ebp)addl$-4,%ebpjmp_Main_knownzufunzu2_entryLclH:movl$_Main_knownzuappzu2_closure,%esijmp*-4(%ebx)
 ```

-
 As before, upon entry the STG stack is guaranteed to have a single
 closure pointer at its top. However, in order to call into
 known_fun_2 we need at least two free stack slots at the top for
 arguments, which means that we have to grow the stack by one word
 before we can make the call.

-#### Checking for sufficient stack space

+#### Checking for sufficient stack space

 First, we check to see if growing the stack would overflow
 allocated stack space, by comparing the STG stack pointer register
@@ -355,7 +332,6 @@ allocated stack space, by comparing the STG stack pointer register
 if(Sp -4< SpLim)goto clH;
 ```

-
 (The stack grows downwards, hence the *subtraction* of 4 from the
 current `Sp`). If the stack check fails, we branch to `clH`:

@@ -365,7 +341,6 @@ clH:
    jump stg_gc_fun();
 ```

-
 This stores the closure of the current function in `R1`, and then
 jumps into the hand-written garbage collector code to force it to
 grow the stack. After the stack has been grown, the collector will
@@ -374,8 +349,8 @@ stored in the (statically-allocated) `Main_knownzuappzu2_closure`
 closure pointed to by `R1`, and the stack check will be run again -
 hopefully succeeding this time!

-#### Making the known call

+#### Making the known call

 Given that the stack check succeeds, it is easy to make the actual
 call we are after. We simply grow the stack by the required amount,
@@ -389,7 +364,6 @@ course):
    Sp = Sp -4;
 ```

-
 A simple tail call to the new function finishes us off:

 ```
@@ -398,7 +372,6 @@ A simple tail call to the new function finishes us off:

 ## Example 3: Unsaturated applications to known functions

-
 Despite describing an undersaturated call, this Haskell code

 ```
@@ -408,7 +381,6 @@ known_fun_2 x _= x
 known_undersaturated_app::()->Int->Intknown_undersaturated_app_= known_fun_2 10
 ```

-
 compiles to straightforward C-- as follows

 ```
@@ -421,15 +393,14 @@ Main_knownzuundersaturatedzuapp_entry(){cmd:
 _Main_knownzuundersaturatedzuapp_entry:Lcmd:movlL_stg_INTLIKE_closure$non_lazy_ptr,%eaxaddl$209,%eaxmovl%eax,(%ebp)jmp_Main_knownzufunzu2_entry
 ```

-
 The reason that there is no special magic to deal with
 undersaturated applications to known functions is simple: GHC
 simply gives `known_undersaturated_app` an arity of 2, so by the
 time we jump to the entry code the stack must already contain any
 arguments required by `known_fun_2`.

-## Example 4: Applications to unknown functions

+## Example 4: Applications to unknown functions

 We aren't going to tackle oversaturated calls to known functions
 until we've considered happens to calls to statically-unknown
@@ -440,7 +411,6 @@ following Haskell code
 unknown_app::(Int->Int)->Int->Intunknown_app f x = f x
 ```

-
 Which compiles to this C-- function

 ```
@@ -454,7 +424,6 @@ Main_unknownzuapp_entry(){cnO:
 _Main_unknownzuapp_entry:Lcn0:movl(%ebp),%esiaddl$4,%ebpjmp_stg_ap_p_fast
 ```

-
 Unlike the previous cases we have looked at, we are compiling an
 application where we don't statically know either the arity or the
 info pointer of the function being applied. To deal with such
@@ -462,43 +431,41 @@ cases, the STG machine uses several pre-compiled "generic apply"
 functions which inspect the info-table for the function in question
 and decide how the available arguments should be applied to it.

-### Dealing with generic application

+### Dealing with generic application

 There are three cases the generic apply functions have to deal
 with:

 1. The function's arity (recorder in the function closure's info
-  table) exactly matches the number of arguments available on the
-  stack
+   table) exactly matches the number of arguments available on the
+   stack

-  - This is the best case. In this case, the generic apply function
-    simply makes a tail call into the function's entry code
+   - This is the best case. In this case, the generic apply function
+     simply makes a tail call into the function's entry code

 1. The function's arity is greater than the number of arguments
-  available on the stack
+   available on the stack

-  - In this case, the generic apply code allocates a PAP (partial
-    application) closure which closes over both the new arguments and
-    the function pointer, and returns that value, in the normal STGish
-    way, to the continuation on the top of the stack
+   - In this case, the generic apply code allocates a PAP (partial
+     application) closure which closes over both the new arguments and
+     the function pointer, and returns that value, in the normal STGish
+     way, to the continuation on the top of the stack

 1. The function's arity is less than the number of arguments
-  available on the stack
-
-  - In this case, a number of arguments matching the arity are pushed
-    on top of the stack, followed by a continuation which uses another
-    of the generic apply functions to apply the remaining arguments.
-    The code for the original function is then entered
-  - Eventually the code for the continuation is entered and another
-    generic apply function will be tail-called to deal with the
-    result
+   available on the stack

+   - In this case, a number of arguments matching the arity are pushed
+     on top of the stack, followed by a continuation which uses another
+     of the generic apply functions to apply the remaining arguments.
+     The code for the original function is then entered
+   - Eventually the code for the continuation is entered and another
+     generic apply function will be tail-called to deal with the
+     result

 Potentially, one generic apply function is required for every
 "argument pattern". Some example argument patterns are:

-
 Because the number of patterns is large (actually unbounded,
 because functions might be of any arity), GHC only generates
 generic apply functions for enough patterns so that 99.9% of all
@@ -508,8 +475,8 @@ chaining together several smaller generic apply functions, in a
 similar manner as when dealing with oversaturated function
 applications.

-### Making the call to the generic application code

+### Making the call to the generic application code

 Let's remind ourselves of the original code:

@@ -519,15 +486,14 @@ Let's remind ourselves of the original code:
    jump stg_ap_p_fast();
 ```

-
 Knowing about generic apply functions, the call itself is easy to
 understand. We pop the top of the stack (the function argument)
 into `R1` and then jump into the generic application code for the
 case where the stack contains a single pointer argument, which
 deals with all the cases for `f` described above.

-## Example 5: oversaturated applications to known functions

+## Example 5: oversaturated applications to known functions

 This Haskell code

@@ -538,7 +504,6 @@ known_fun_2 x _= x
 known_oversat_app::()->Intknown_oversat_app_= known_fun_2 id id 10
 ```

-
 compiles to the following C-- function

 ```
@@ -557,12 +522,10 @@ Main_knownzuoversatzuapp_entry(){cmj:if(Sp -12< SpLim)goto cmm;
 _Main_knownzuoversatzuapp_entry:Lcmj:leal-12(%ebp),%eaxcmpl84(%ebx),%eaxjbLcmmmovlL_stg_INTLIKE_closure$non_lazy_ptr,%eaxaddl$209,%eaxmovl%eax,(%ebp)movlL_stg_ap_p_info$non_lazy_ptr,%eaxmovl%eax,-4(%ebp)movl$_base_GHCziBase_id_closure,-8(%ebp)movl$_base_GHCziBase_id_closure,-12(%ebp)addl$-12,%ebpjmp_Main_knownzufunzu2_entryLcmm:movl$_Main_knownzuoversatzuapp_closure,%esijmp*-4(%ebx)
 ```

-
 As you might see, despite being a call to a known function, this
 code makes use of the generic apply functions we discussed in the
 last section. Let's pick the function apart and see how it works.

-
 First, we do the usual stack check. What differs from the last time
 we saw this check is that we are not only allocating space for
 arguments on the stack, but also for a *continuation*. We set up
@@ -576,7 +539,6 @@ these new stack entries as follows:
    Sp = Sp -12;
 ```

-
 i.e. the final stack looks as follows (note that the code
 overwrites the old pointer to a closure of type ()):

@@ -595,14 +557,12 @@ overwrites the old pointer to a closure of type ()):

 ```

-
 Because `known_fun_2` is of arity 2, when we jump to its entry
 code, it will only consume the top two arguments from the stack:
 i.e. the two pointers to `base_GHCziBase_id_closure`. It will then
 evaluate to some sort of value and transfer control to the entry
 code for `stg_ap_p_info`.

-
 This is where the magic happens: the entry code for `stg_ap_p_info`
 will apply the function value that was returned from `known_fun_2`
 to the (pointer) argument in the "free variable" of its (stack
@@ -612,12 +572,11 @@ allocated) closure -- and we have arranged that that is
 for calls to unknown functions, so this will make use of the
 `stg_ap_p_fast` function we saw before.

-
 Finally, control will be transferred back to the caller for
 `known_oversat_app`, and all will be well.

-## Example 6: allocation of thunks and data

+## Example 6: allocation of thunks and data

 Something that happens all the time in Haskell is allocation. There
 are three principal types of thing that get allocated: function
@@ -640,7 +599,6 @@ common characteristics:
 - Info-tables containing various miscellaneous metadata about the
  heap object, such as function arity

-
 Let us look at how a thunk and a data constructor get allocated in
 a simple setting:

@@ -648,7 +606,6 @@ a simple setting:
 build_data::Int->MaybeIntbuild_data x =Just(x +1)
 ```

-
 This compiles into the following C--:

 ```
@@ -670,18 +627,16 @@ Main_buildzudata_entry(){clE:
 _Main_buildzudata_entry:LclE:addl$20,%edicmpl92(%ebx),%edijaLclHmovl$_slk_info,-16(%edi)movl(%ebp),%eaxmovl%eax,-8(%edi)movl$_base_DataziMaybe_Just_con_info,-4(%edi)leal-16(%edi),%eaxmovl%eax,(%edi)leal-2(%edi),%esiaddl$4,%ebpmovl(%ebp),%eaxjmp*(%eax)LclH:movl$20,112(%ebx)LclI:movl$_Main_buildzudata_closure,%esijmp*-4(%ebx)
 ```

-
 Let's break this function down slowly.

-### Checking for sufficient heap space

+### Checking for sufficient heap space

 Any function that needs to allocate memory might find that the heap
 has been exhausted. If that happens, it needs to call into the
 garbage collector in order to get the heap cleaned up and
 (possibly) enlarged.

-
 Hence, the first thing any such function does is check to see if
 enough memory is available for its purposes:

@@ -693,14 +648,12 @@ clE:
    HpAlloc =20;goto clI;
 ```

-
 This is simple enough. The function needs to allocate 20 bytes (the
 data constructor takes up 2 words, and the thunk will take up 3),
 so it speculatively increments Hp and then checks the STG registers
 `Hp` and `HpLim` (the pointer to the top of the available heap
 space) against each other.

-
 If memory is insufficient (i.e. we have moved `Hp` past the top of
 the available heap), the code deals with it by setting the
 `HpAlloc` register to the number of bytes needed and `R1` to the
@@ -710,7 +663,6 @@ cleanup. The garbage collector will resume execution of the code by
 using the information from `R1`, after it has freed up enough
 memory.

-
 Side note: I believe that the line setting `R1` is unnecessary
 here, because `R1` should anyway always be set to the address of
 the closure when executing the closure entry code. I could be
@@ -718,7 +670,6 @@ wrong, though.

 ### Performing the actual allocation

-
 Once the heap check succeeds, we will be able to enter the body of
 the function proper. Since the `Hp` has already been incremented,
 we can just construct the new heap objects directly:
@@ -730,7 +681,6 @@ we can just construct the new heap objects directly:
    I32[Hp +0]= Hp -16;
 ```

-
 So we get something like this:

 ```wiki
@@ -750,12 +700,11 @@ So we get something like this:

 ```

-
 The bottom two words are the allocated `Just` value, and the three
 above that correspond to the `x + 1` closure.

-### Returning an allocated value to the caller

+### Returning an allocated value to the caller

 Now that we have allocated the data we entered the function in
 order to construct, we need to return it to the caller. This is
@@ -767,49 +716,41 @@ achieved by the following code:
    jump (I32[I32[Sp +0]])();
 ```

-
 To return, the STG machine:

 1. Sets `R1` to the pointer to the result of evaluation
 1. Pops all the arguments to the function from the stack
 1. Jumps to the entry code for the continuation. This is always
-  found at the top of the STG stack, logically below any arguments
-  that were pushed to make the call.
-
+   found at the top of the STG stack, logically below any arguments
+   that were pushed to make the call.

 This is indeed exactly what happens here, with two interesting
 points: pointer tagging, and the double-deference of the stack
 pointer. These will be discussed in the next two subsections.

-#### Pointer tagging

+#### Pointer tagging

 One exciting feature is that the code setting `R1`, i.e.
 `R1 = Hp - 2`. This is setting `R1` to point to the `Just`, we just
-allocated, but simultaneously tagging that pointer with the value
-
-1. The fact that the tag is non-zero indicates to users of the
-
-
+allocated, but simultaneously tagging that pointer with the
+value 1. The fact that the tag is non-zero indicates to users of the
 pointer that the thing pointed to is already evaluated.
 Furthermore, because `Maybe` has only two constructors, we are able
 to use the pointer tags to record which constructor it evaluated
 to: in this case, the 2 indicates the `Just` constructor.

-
 It is compulsory to tag pointers before jumping to the address of
 the continuation entry code: the entry code can and will rely on
 those tags being present!

 #### `TABLES_NEXT_TO_CODE`

-
 Because I have compiled GHC without `TABLES_NEXT_TO_CODE`, the
 entry code for the continuation is found by dereferencing the
 pointer to the info table we found at the top of the STG stack -
 i.e. a double-dereference.

-
 The layout of heap objects without `TABLES_NEXT_TO_CODE` is as
 follows:

@@ -823,7 +764,6 @@ follows:

 ```

-
 With `TABLES_NEXT_TO_CODE` on, the situation looks more like this:

 ```wiki
@@ -836,7 +776,6 @@ With `TABLES_NEXT_TO_CODE` on, the situation looks more like this:

 ```

-
 The `TABLES_NEXT_TO_CODE` optimisation removes the need for that
 second dereference during the return, because the entry code is
 always right next to the info table. However, it requires special
@@ -844,8 +783,8 @@ support from the backend for ensuring that data (i.e. the info
 table) and code are contiguous in memory, so it cannot always be
 used.

-## Example 7: `case` expressions

+## Example 7: `case` expressions

 Let us now examine how `case` expressions are handled. Compiling
 the following Haskell
@@ -855,7 +794,6 @@ case_scrut :: Maybe Int -> Int
 case_scrut x = case x of Just x -> x; Nothing -> 10
 ```

-
 Produces this C-- code

 ```
@@ -880,15 +818,14 @@ scj_ret(){cct:
 _Main_casezuscrut_entry:Lccx:movl(%ebp),%esimovl$_scj_info,(%ebp)testl$3,%esijneLccAmovl(%esi),%eaxjmp*(%eax)LccA:jmp*_scj_info_scj_ret:Lcct:movl%esi,%eaxandl$3,%eaxcmpl$2,%eaxjaeLccvmovlL_stg_INTLIKE_closure$non_lazy_ptr,%eaxleal209(%eax),%esiaddl$4,%ebpmovl(%ebp),%eaxjmp*(%eax)Lccv:movl2(%esi),%esiaddl$4,%ebpandl$-4,%esimovl(%esi),%eaxjmp*(%eax)
 ```

-
 Notice that GHC has generated *two* functions:
 `Main_casezuscrut_entry` and `scj_ret` correspond to the code for
 forcing the argument to the `case`, and for the *continuation* of
 the `case` respectively. Let's pick them apart and see how they
 work!

-### Forcing the scrutinee of the `case`

+### Forcing the scrutinee of the `case`

 When we first call the `case_scrut` function, its entry code begins
 executing:
@@ -901,7 +838,6 @@ ccx:
    jump (I32[scj_info])();
 ```

-
 This is a function of arity 1 (i.e. with a single argument), so
 upon entry the machine state looks like this:

@@ -918,13 +854,11 @@ upon entry the machine state looks like this:

 ```

-
 Because this is a top level function, the closure is statically
 allocated and contains no free variables. However, as discussed
 previously, the single argument to the function is guaranteed to be
 present at the top of the stack.

-
 The code starts off by saving this argument (the `x`) temporarily
 into `R1`:

@@ -933,7 +867,6 @@ ccx:
    R1 = I32[Sp +0];
 ```

-
 The next thing the code does is overwrites this argument on the
 stack with a pointer to the info-table of the continuation code.
 This is the code that will be invoked after `x` has been evaluated
@@ -944,7 +877,6 @@ as the `Nothing` or as the `Just` branch of the case:
    I32[Sp +0]= scj_info;
 ```

-
 As we saw earlier, any time that the STG machine decides that it
 has a value in its hand, it will continue evaluation by
 tail-calling the entry code found by dereferencing the info-table
@@ -952,13 +884,11 @@ pointer at the top of the stack. So by putting the address of our
 continuation in here, we ensure that the entry code for `scj_info`
 is executed after `x` becomes a value.

-
 Now, what we need to do is to start the evaluation of `x`. We could
 just jump into `x`'s entry code and hope for the best, but thanks
 to GHC's pointer tagging we can sometimes avoid doing this indirect
 branch.

-
 So, instead, we test to see if the `x` pointer has a tag. If it is
 tagged, then we know that it is already evaluated and hence jump
 directly to the code for the continuation. If it is not tagged, we
@@ -971,14 +901,12 @@ if(R1 &3!=0)goto ccA;
    jump (I32[scj_info])();
 ```

-
 Note the test `R1 & 3 != 0`: this reflects the fact that pointer
 tags are stored in the lower 2 bits of the pointer on 32 bit
 machines. Another interesting feature is how the `jump`
 instructions find the entry code: again, we see a deference of the
 info pointer because `TABLES_NEXT_TO_CODE` is turned off.

-
 As we saw, the `case` scrutinisation code ended with one of two
 things happening: 1. A direct call into the continuation code
 `scj_ret` if the scrutinee was already evaluated 2. A call into the
@@ -988,12 +916,11 @@ with that information) - Because we pushed `scj_info` onto the STG
 stack, control will eventually return to `scj_ret` after the
 evaluation of `x` has finished

-
 It is now time to examine the continuation code to see what happens
 after `x` becomes a value.

-### Dealing with the forced scrutinee

+### Dealing with the forced scrutinee

 The continuation code is a little more complicated:

@@ -1009,7 +936,6 @@ cct:
    jump (I32[I32[R1]])();
 ```

-
 Whenever the STG machine evaluates to a value it will return the
 value by jumping to the entry point at the top of the stack. In
 this case, `R1` is guaranteed to be a (tagged) pointer to the thing
@@ -1030,7 +956,6 @@ constructor it was evaluated to. In this case, we would have to
 read the constructor tag by dereferencing `R1` and testing the
 resulting info table pointer against all possibilities.

-
 If the tag was greater than or equal to 2, we go to the `ccv`
 branch, which deals with what happens if we had a `Just`. In this
 case, we need to continue by forcing the thunk inside the `Just`
@@ -1045,7 +970,6 @@ ccv:
    jump (I32[I32[R1]])();
 ```

-
 To access the thing inside the `Just`, the code assumes that the
 `R1` pointer is tagged with the 2 that indicates a `Just`
 constructor, and hence finds the first free variable (stored 4
@@ -1056,14 +980,13 @@ the stack in `Main_casezuscrut_entry` by moving `Sp` up 4 bytes
 jumps into the entry code for the `R1` thunk, using the same
 double-dereference pattern discussed earlier.

-
 There seems to be a small missed opportunity here: the code could
 check the pointer tag on `R1`, and then return directly if it is
 set. I imagine that this isn't being done in order to reduce
 possible code bloat.

-## Example 8: thunks and thunk update

+## Example 8: thunks and thunk update

 You might be wondering how the `x + 1` thunk we saw allocated in a
 previous section will behave when it is actually forced. To remind
@@ -1074,7 +997,6 @@ code:
 build_data::Int->MaybeIntbuild_data x =Just(x +1)
 ```

-
 So how does the `x + 1` thunk work? An excellent question! Let's
 take a look at the C-- for its entry code and find out:

@@ -1102,7 +1024,6 @@ soN_ret(){cp7:
 _sst_entry:Lcph:leal-12(%ebp),%eaxcmpl84(%ebx),%eaxjbLcpjmovlL_stg_upd_frame_info$non_lazy_ptr,%eaxmovl%eax,-8(%ebp)movl%esi,-4(%ebp)movl8(%esi),%esimovl$_soN_info,-12(%ebp)addl$-12,%ebptestl$3,%esijneLcpkmovl(%esi),%eaxjmp*(%eax)Lcpj:jmp*-8(%ebx)Lcpk:jmp*_soN_info_soN_ret:Lcp7:addl$8,%edicmpl92(%ebx),%edijaLcpcmovl3(%esi),%eaxincl%eaxmovl$_ghczmprim_GHCziTypes_Izh_con_info,-4(%edi)movl%eax,(%edi)leal-3(%edi),%esiaddl$4,%ebpmovlL_stg_upd_frame_info$non_lazy_ptr,%eaxjmp*(%eax)Lcpc:movl$8,112(%ebx)Lcpd:jmp*-8(%ebx)
 ```

-
 The original Haskell code read `x + 1`, but GHC has inlined the
 actual code for the addition operation on `Int`s, which looks
 something like:
@@ -1111,15 +1032,14 @@ something like:
 plusInt(I# a)(I# b)=I#(a + b)
 ```

-
 The second pattern match (to get `b`) has been performed statically
 by GHC, obtaining the machine literal 1, which shows up directly in
 the generated code. Therefore, the code only need to evaluate and
 case-decompose the unknown free variable `x` of our closure, to get
 the `a` argument to `plusInt`.

-### Thunk entry point

+### Thunk entry point

 This evaluation is what is being done by the thunk entry code
 `slk_entry`. Ignoring the stack check, the C-- begins thusly:
@@ -1132,7 +1052,6 @@ This evaluation is what is being done by the thunk entry code
    Sp = Sp -12;
 ```

-
 Remembering that upon entry to the thunk entry code, `R1` points to
 the thunk's closure, the new stack looks as follows:

@@ -1151,12 +1070,10 @@ the thunk's closure, the new stack looks as follows:
                                       +-------------+
 ```

-
 The C-- statement `R1 = I32[R1 + 8]` is pulling out the pointer to
 the free variable of the thunk (which was set up in
 `Main_buildzudata_entry`) into `R1`.

-
 Finally, the entry code evaluates that free variable (checking the
 tag bits of the pointer first, as usual):

@@ -1165,18 +1082,15 @@ if(R1 &3!=0)goto cpk;
    jump (I32[I32[R1]])();cpk: jump (I32[soN_info])();
 ```

-
 Because we put `soN_info` at the top of the stack, when evaluation
 of `x` is complete the STG machine will continue by executing the
 `soN_ret` code.

-
 The most interesting feature of this code is the extra stuff that
 has been pushed onto the stack below `soN_ret`: an info pointer
 called `stg_upd_frame_info`, and a pointer to the thunk currently
 being evaluated.

-
 This is all part of the STG machine's thunk update mechanism. When
 the `soN_ret` continuation returns, it will transfer control
 *not* to the code forcing the thunk, but to some code which
@@ -1186,18 +1100,16 @@ indirection closure is trivial: it immediately returns a pointer to
 the thing that was returned from the `soN_ret` continuation in
 `R1`.

-
 These indirections are the mechanism which ensures that the STG
 machine never repeats the work of evaluating a thunk more than
 once: after the first evaluation, any code forcing the thunk jumps
 into the indirection entry code rather than `slk_entry`.

-
 That being said, let us look at how the continuation responsible
 for actually finding the value of `x + 1` works:

-### Continuation of the thunk

+### Continuation of the thunk

 Upon entry to the continuation code, we have the evaluated `x` in
 `R1`: it now needs to do the addition and allocate a `I#`
@@ -1214,7 +1126,6 @@ check, we have the following code:
    jump (I32[stg_upd_frame_info])();
 ```

-
 This is mostly standard stuff. Because the `R1` pointer is
 guaranteed tagged, and there is only one possible constructor, the
 tag must be 1 and so the `Int#` value inside the `Int` is pulled
@@ -1222,15 +1133,14 @@ out using `I32[R1 + 3]`. This is then put into a newly
 heap-allocated `I#` constructor, which is returned in `R1` after we
 pop the `soN_info` pointer from the stack.

-
 The only interesting point is where we return to: rather than
 dereference `Sp` to find the info pointer at the top of the STG
 stack, GHC has generated code that takes advantage of the fact that
 the `Sp` is guaranteed to point to `stg_upd_frame_info`. This
 avoids one pointer dereference.

-## Conclusion

+## Conclusion

 This document has left much of the detail of how STG is implemented
 out: notable omissions include CAFs, and the precise behaviour of