|
|
# The GHC Commentary: Data types and data constructors
|
|
|
|
|
|
|
|
|
This chapter was thoroughly changed Feb 2003.
|
|
|
|
|
|
## Data types
|
|
|
|
|
|
|
|
|
Consider the following data type declaration:
|
|
|
|
|
|
```wiki
|
|
|
data T a = MkT !(a,a) !(T a) | Nil
|
|
|
|
|
|
f x = case x of
|
|
|
MkT p q -> MkT p (q+1)
|
|
|
Nil -> Nil
|
|
|
```
|
|
|
|
|
|
|
|
|
The user's source program mentions only the constructors `MkT` and `Nil`. However, these constructors actually *do* something in addition to building a data value. For a start, `MkT` evaluates its arguments. Secondly, with the flag `-funbox-strict-fields` GHC will flatten (or unbox) the strict fields. So we may imagine that there's the *source* constructor `MkT` and the *representation* constructor `MkT`, and things start to get pretty confusing.
|
|
|
|
|
|
|
|
|
GHC now generates three unique `Name`s for each data constructor:
|
|
|
|
|
|
```wiki
|
|
|
---- OccName ------
|
|
|
String Name space Used for
|
|
|
---------------------------------------------------------------------------
|
|
|
The "source data con" MkT DataName The DataCon itself
|
|
|
The "worker data con" MkT VarName Its worker Id
|
|
|
aka "representation data con"
|
|
|
The "wrapper data con" $WMkT VarName Its wrapper Id (optional)
|
|
|
```
|
|
|
|
|
|
|
|
|
Recall that each occurrence name (OccName) is a pair of a string and a name space (see [ The truth about names](http://darcs.haskell.org/ghc/docs/comm/the-beast/names.html) (TODO Point at new commentary equivalent)), and two OccNames are considered the same only if both components match. That is what distinguishes the name of the name of the DataCon from the name of its worker Id. To keep things unambiguous, in what follows we'll write "MkT{d}" for the source data con, and "MkT{v}" for the worker Id. (Indeed, when you dump stuff with "-ddumpXXX", if you also add "-dppr-debug" you'll get stuff like "Foo {- d rMv -}". The "d" part is the name space; the "rMv" is the unique key.)
|
|
|
|
|
|
|
|
|
Each of these three names gets a distinct unique key in GHC's name cache.
|
|
|
|
|
|
# The life cycle of a data type
|
|
|
|
|
|
|
|
|
Suppose the Haskell source looks like this:
|
|
|
|
|
|
```wiki
|
|
|
data T a = MkT !(a,a) !Int | Nil
|
|
|
|
|
|
f x = case x of
|
|
|
Nil -> Nil
|
|
|
MkT p q -> MkT p (q+1)
|
|
|
```
|
|
|
|
|
|
|
|
|
When the parser reads it in, it decides which name space each lexeme comes from, thus:
|
|
|
|
|
|
```wiki
|
|
|
data T a = MkT{d} !(a,a) !Int | Nil{d}
|
|
|
|
|
|
f x = case x of
|
|
|
Nil{d} -> Nil{d}
|
|
|
MkT{d} p q -> MkT{d} p (q+1)
|
|
|
```
|
|
|
|
|
|
|
|
|
Notice that in the Haskell source *all data contructors are named via the "source data con" MkT{d}*, whether in pattern matching or in expressions.
|
|
|
|
|
|
|
|
|
In the translated source produced by the type checker (-ddump-tc), the program looks like this:
|
|
|
|
|
|
```wiki
|
|
|
f x = case x of
|
|
|
Nil{d} -> Nil{v}
|
|
|
MkT{d} p q -> $WMkT p (q+1)
|
|
|
```
|
|
|
|
|
|
|
|
|
Notice that the type checker replaces the occurrence of MkT by the *wrapper*, but the occurrence of Nil by the *worker*. Reason: Nil doesn't have a wrapper because there is nothing to do in the wrapper (this is the vastly common case).
|
|
|
|
|
|
|
|
|
Though they are not printed out by "-ddump-tc", behind the scenes, there are also the following: the data type declaration and the wrapper function for MkT.
|
|
|
|
|
|
```wiki
|
|
|
data T a = MkT{d} a a Int# | Nil{d}
|
|
|
|
|
|
$WMkT :: (a,a) -> T a -> T a
|
|
|
$WMkT p t = case p of
|
|
|
(a,b) -> seq t (MkT{v} a b t)
|
|
|
```
|
|
|
|
|
|
|
|
|
Here, the *wrapper* $WMkT evaluates and takes apart the argument p, evaluates the argument t, and builds a three-field data value with the *worker* constructor MkT{v}. (There are more notes below about the unboxing of strict fields.) The worker $WMkT is called an *implicit binding*, because it's introduced implicitly by the data type declaration (record selectors are also implicit bindings, for example). Implicit bindings are injected into the code just before emitting code or External Core.
|
|
|
|
|
|
|
|
|
After desugaring into Core (-ddump-ds), the definition of f looks like this:
|
|
|
|
|
|
```wiki
|
|
|
f x = case x of
|
|
|
Nil{d} -> Nil{v}
|
|
|
MkT{d} a b r -> let { p = (a,b); q = I# r } in
|
|
|
$WMkT p (q+1)
|
|
|
```
|
|
|
|
|
|
|
|
|
Notice the way that pattern matching has been desugared to take account of the fact that the "real" data constructor MkT has three fields.
|
|
|
|
|
|
|
|
|
By the time the simplifier has had a go at it, f will be transformed to:
|
|
|
|
|
|
```wiki
|
|
|
f x = case x of
|
|
|
Nil{d} -> Nil{v}
|
|
|
MkT{d} a b r -> MkT{v} a b (r +# 1#)
|
|
|
```
|
|
|
|
|
|
|
|
|
Which is highly cool.
|
|
|
|
|
|
## The constructor wrapper functions
|
|
|
|
|
|
|
|
|
The wrapper functions are automatically generated by GHC, and are really emitted into the result code (albeit only after CorePre; see `CorePrep.mkImplicitBinds`). The wrapper functions are inlined very vigorously, so you will not see many occurrences of the wrapper functions in an optimised program, but you may see some. For example, if your Haskell source has
|
|
|
|
|
|
```wiki
|
|
|
map MkT xs
|
|
|
```
|
|
|
|
|
|
|
|
|
then `$WMkT` will not be inlined (because it is not applied to anything). That is why we generate real top-level bindings for the wrapper functions, and generate code for them.
|
|
|
|
|
|
## The constructor worker functions
|
|
|
|
|
|
|
|
|
Saturated applications of the constructor worker function MkT{v} are treated specially by the code generator; they really do allocation. However, we do want a single, shared, top-level definition for top-level nullary constructors (like True and False). Furthermore, what if the code generator encounters a non-saturated application of a worker? E.g. (`map Just xs`). We could declare that to be an error (CorePrep should saturate them). But instead we currently generate a top-level defintion for each constructor worker, whether nullary or not. It takes the form:
|
|
|
|
|
|
```wiki
|
|
|
MkT{v} = \ p q r -> MkT{v} p q r
|
|
|
```
|
|
|
|
|
|
|
|
|
This is a real hack. The occurrence on the RHS is saturated, so the code generator (both the one that generates abstract C and the byte-code generator) treats it as a special case and allocates a MkT; it does not make a recursive call! So now there's a top-level curried version of the worker which is available to anyone who wants it.
|
|
|
|
|
|
|
|
|
This strange definition is not emitted into External Core. Indeed, you might argue that we should instead pass the list of `TyCon`s to the code generator and have it generate magic bindings directly. As it stands, it's a real hack: see the code in CorePrep.mkImplicitBinds.
|
|
|
|
|
|
## External Core
|
|
|
|
|
|
|
|
|
When emitting External Core, we should see this for our running example:
|
|
|
|
|
|
```wiki
|
|
|
data T a = MkT a a Int# | Nil{d}
|
|
|
|
|
|
$WMkT :: (a,a) -> T a -> T a
|
|
|
$WMkT p t = case p of
|
|
|
(a,b) -> seq t (MkT a b t)
|
|
|
|
|
|
f x = case x of
|
|
|
Nil -> Nil
|
|
|
MkT a b r -> MkT a b (r +# 1#)
|
|
|
```
|
|
|
|
|
|
|
|
|
Notice that it makes perfect sense as a program all by itself. Constructors look like constructors (albeit not identical to the original Haskell ones).
|
|
|
|
|
|
|
|
|
When reading in External Core, the parser is careful to read it back in just as it was before it was spat out, namely:
|
|
|
|
|
|
```wiki
|
|
|
data T a = MkT{d} a a Int# | Nil{d}
|
|
|
|
|
|
$WMkT :: (a,a) -> T a -> T a
|
|
|
$WMkT p t = case p of
|
|
|
(a,b) -> seq t (MkT{v} a b t)
|
|
|
|
|
|
f x = case x of
|
|
|
Nil{d} -> Nil{v}
|
|
|
MkT{d} a b r -> MkT{v} a b (r +# 1#)
|
|
|
```
|
|
|
|
|
|
## Unboxing strict fields
|
|
|
|
|
|
|
|
|
If GHC unboxes strict fields (as in the first argument of MkT above), it also transforms source-language case expressions. Suppose you write this in your Haskell source:
|
|
|
|
|
|
```wiki
|
|
|
case e of
|
|
|
MkT p t -> ..p..t..
|
|
|
```
|
|
|
|
|
|
|
|
|
GHC will desugar this to the following Core code:
|
|
|
|
|
|
```wiki
|
|
|
case e of
|
|
|
MkT a b t -> let p = (a,b) in ..p..t..
|
|
|
```
|
|
|
|
|
|
|
|
|
The local let-binding reboxes the pair because it may be mentioned in the case alternative. This may well be a bad idea, which is why `-funbox-strict-fields` is an experimental feature.
|
|
|
|
|
|
|
|
|
It's essential that when importing a type `T` defined in some external module `M`, GHC knows what representation was used for that type, and that in turn depends on whether module M was compiled with `-funbox-strict-fields`. So when writing an interface file, GHC therefore records with each data type whether its strict fields (if any) should be unboxed.
|
|
|
|
|
|
## Labels and info tables
|
|
|
|
|
|
*Quick rough notes: SLPJ March 2003.*
|
|
|
|
|
|
|
|
|
Every data constructor `C` has two info tables:
|
|
|
|
|
|
- The static info table (label `C_static_info`), used for statically-allocated constructors.
|
|
|
- The dynamic info table (label `C_con_info`), used for dynamically-allocated constructors.
|
|
|
|
|
|
|
|
|
Statically-allocated constructors are not moved by the garbage collector, and therefore have a different closure type from dynamically-allocated constructors; hence they need a distinct info table. Both info tables share the same entry code, but since the entry code is physically juxtaposed with the info table, it must be duplicated (`C_static_entry` and `C_con_entry` respectively). |