Commit 77b2c81d authored by simonpj's avatar simonpj
Browse files

[project @ 2003-02-11 17:19:35 by simonpj]

Lots of new stuff about data types
parent e0741782
......@@ -65,17 +65,13 @@ identifiers, expressions, rules, and their operations.</strong>
Type (loop DataCon.DataCon, loop Subst.substTy)
FieldLabel( Type) <br>
FieldLabel(Type) <br>
TysPrim(Type) <br>
PprEnv (loop DataCon.DataCon, Type)
Unify <br>
PprType (PprEnv)
Literal (TysPrim, PprType) <br>
DataCon (loop PprType)
DataCon (loop PprType, loop Subst.substTyWith, FieldLabel.FieldLabel)
TysWiredIn (DataCon.mkDataCon, loop MkId.mkDataConId, loop Generics.mkGenInfo)
TysWiredIn (loop MkId.mkDataConWorkId, loop Generics.mkGenInfo, DataCon.mkDataCon)
TcType( lots of TysWiredIn stuff)
......@@ -60,7 +60,7 @@
<li><a href="the-beast/basicTypes.html">The Basics</a>
<li><a href="the-beast/modules.html">Modules, ModuleNames and
<li><a href="the-beast/names.html">The truth about names: Names and OccNamesd</a>
<li><a href="the-beast/names.html">The truth about names: Names and OccNames</a>
<li><a href="the-beast/vars.html">The Real Story about Variables, Ids,
TyVars, and the like</a>
<li><a href="the-beast/data-types.html">Data types and constructors</a>
......@@ -9,6 +9,7 @@
<h1>The GHC Commentary - Data types and data constructors</h1>
This chapter was thoroughly changed Feb 2003.
<h2>Data types</h2>
......@@ -16,44 +17,111 @@ Consider the following data type declaration:
data T a = MkT !(a,a) !(T a) | Nil
f x = case x of
MkT p q -> MkT p (q+1)
Nil -> Nil
The user's source program mentions only the constructors <tt>MkT</tt>
and <tt>Nil</tt>. However, these constructors actually <em>do</em> something
in addition to building a data value. For a start, <tt>MkT</tt> evaluates
its arguments. Secondly, with the flag <tt>-funbox-strict-fields</tt> GHC
will flatten (or unbox) the strict fields. So GHC generates a top-level function
for each data constructor, as follows:
will flatten (or unbox) the strict fields. So we may imagine that there's the
<em>source</em> constructor <tt>MkT</tt> and the <em>representation</em> constructor
<tt>MkT</tt>, and things start to get pretty confusing.
GHC now generates three unique <tt>Name</tt>s for each data constructor:
---- OccName ------
String Name space Used for
The "source data con" MkT DataName The DataCon itself
The "worker data con" MkT VarName Its worker Id
aka "representation data con"
The "wrapper data con" $WMkT VarName Its wrapper Id (optional)
Recall that each occurrence name (OccName) is a pair of a string and a
name space (see <a href="names.html">The truth about names</a>), and
two OccNames are considered the same only if both components match.
That is what distinguishes the name of the name of the DataCon from
the name of its worker Id. To keep things unambiguous, in what
follows we'll write "MkT{d}" for the source data con, and "MkT{v}" for
the worker Id. (Indeed, when you dump stuff with "-ddumpXXX", if you
also add "-dppr-debug" you'll get stuff like "Foo {- d rMv -}". The
"d" part is the name space; the "rMv" is the unique key.)
Each of these three names gets a distinct unique key in GHC's name cache.
<h2>The life cycle of a data type</h2>
Suppose the Haskell source looks like this:
MkT :: (a,a) -> T a -> T a
MkT p t = case p of
(a,b) -> seq t ($wMkT a b t)
data T a = MkT !(a,a) !Int | Nil{d}
Nil :: T a
Nil = $wNil
f x = case x of
Nil -> Nil
MkT p q -> MkT p (q+1)
When the parser reads it in, it decides which name space each lexeme comes
from, thus:
data T a = MkT{d} !(a,a) !Int | Nil{d}
Here, the <em>wrapper</em> <tt>MkT</tt> evaluates and takes the argument <tt>p</tt>,
f x = case x of
Nil{d} -> Nil{d}
MkT{d} p q -> MkT{d} p (q+1)
Notice that in the Haskell source <em>all data contructors are named via the "source data con" MkT{d}</em>,
whether in pattern matching or in expressions.
In the translated source produced by the type checker (-ddump-tc), the program looks like this:
f x = case x of
Nil{d} -> Nil{v}
MkT{d} p q -> $WMkT p (q+1)
Notice that the type checker replaces the occurrence of MkT by the <em>wrapper</em>, but
the occurrence of Nil by the <em>worker</em>. Reason: Nil doesn't have a wrapper because there is
nothing to do in the wrapper (this is the vastly common case).
Though they are not printed out by "-ddump-tc", behind the scenes, there are
also the following: the data type declaration and the wrapper function for MkT.
data T a = MkT{d} a a Int# | Nil{d}
$WMkT :: (a,a) -> T a -> T a
$WMkT p t = case p of
(a,b) -> seq t (MkT{v} a b t)
Here, the <em>wrapper</em> <tt>$WMkT</tt> evaluates and takes apart the argument <tt>p</tt>,
evaluates the argument <tt>t</tt>, and builds a three-field data value
with the <em>worker</em> constructor <tt>$wMKT</tt>. (There are more notes below
about the unboxing of strict fields.)
with the <em>worker</em> constructor <tt>MkT{v}</tt>. (There are more notes below
about the unboxing of strict fields.) The worker $WMkT is called an <em>implicit binding</em>,
because it's introduced implicitly by the data type declaration (record selectors
are also implicit bindings, for example). Implicit bindings are injected into the code
just before emitting code or External Core.
So the original constructors, <tt>MkT</tt> and <tt>Nil</tt> are really just
<em>wrappers</em> which perhaps do some work before calling the <em>workers</em>
<tt>$wMkT</tt> and <tt>$wNil</tt>. The workers are
the "representation constructors" of
the "representation data type", which we can think of as being defined thus:
After desugaring into Core (-ddump-ds), the definition of <tt>f</tt> looks like this:
f x = case x of
Nil{d} -> Nil{v}
MkT{d} a b r -> let { p = (a,b); q = I#r } in
$WMkT p (q+1)
Notice the way that pattern matching has been desugared to take account of the fact
that the "real" data constructor MkT has three fields.
By the time the simplifier has had a go at it, <tt>f</tt> will be transformed to:
data T a = $wMkT a a Int | $wNil
f x = case x of
Nil{d} -> Nil{v}
MkT{d} a b r -> MkT{v} a b (r +# 1#)
Which is highly cool.
This representation data type, gives the number and types of
fields of the constructors used to represent values of type <tt>T</tt>.
This representation type is also what is emitted when you print External Core
from GHC.
<h3> The constructor wrapper functions </h3>
<h2> The constructor wrapper functions </h2>
The wrapper functions are automatically generated by GHC, and are
really emitted into the result code (albeit only after CorePre; see
......@@ -65,12 +133,70 @@ if your Haskell source has
map MkT xs
then <tt>MkT</tt> will not be inlined (because it is not applied to anything).
then <tt>$WMkT</tt> will not be inlined (because it is not applied to anything).
That is why we generate real top-level bindings for the wrapper functions,
and generate code for them.
<h3> Unboxing strict fields </h3>
<h2> The constructor worker functions </h2>
Saturated applications of the constructor worker function MkT{v} are
treated specially by the code generator; they really do allocation.
However, we do want a single, shared, top-level definition for
top-level nullary constructors (like True and False). Furthermore,
what if the code generator encounters a non-saturated application of a
worker? E.g. <tt>(map Just xs)</tt>. We could declare that to be an
error (CorePrep should saturate them). But instead we currently
generate a top-level defintion for each constructor worker, whether
nullary or not. It takes the form:
MkT{v} = \ p q r -> MkT{v} p q r
This is a real hack. The occurrence on the RHS is saturated, so the code generator (both the
one that generates abstract C and the byte-code generator) treats it as a special case and
allocates a MkT; it does not make a recursive call! So now there's a top-level curried
version of the worker which is available to anyone who wants it.
This strange defintion is not emitted into External Core. Indeed, you might argue that
we should instead pass the list of <tt>TyCon</tt>s to the code generator and have it
generate magic bindings directly. As it stands, it's a real hack: see the code in
<h2> External Core </h2>
When emitting External Core, we should see this for our running example:
data T a = MkT a a Int# | Nil{d}
$WMkT :: (a,a) -> T a -> T a
$WMkT p t = case p of
(a,b) -> seq t (MkT a b t)
f x = case x of
Nil -> Nil
MkT a b r -> MkT a b (r +# 1#)
Notice that it makes perfect sense as a program all by itself. Constructors
look like constructors (albeit not identical to the original Haskell ones).
When reading in External Core, the parser is careful to read it back in just
as it was before it was spat out, namely:
data T a = MkT{d} a a Int# | Nil{d}
$WMkT :: (a,a) -> T a -> T a
$WMkT p t = case p of
(a,b) -> seq t (MkT{v} a b t)
f x = case x of
Nil{d} -> Nil{v}
MkT{d} a b r -> MkT{v} a b (r +# 1#)
<h2> Unboxing strict fields </h2>
If GHC unboxes strict fields (as in the first argument of <tt>MkT</tt> above),
it also transforms
......@@ -82,13 +208,8 @@ source-language case expressions. Suppose you write this in your Haskell source
GHC will desugar this to the following Core code:
case e of
$wMkT a b t -> let p = (a,b) in ..p..t..
MkT a b t -> let p = (a,b) in ..p..t..
(<em>Important note</em>: perhaps misleadingly, when printing Core we
actually print the constructor in the case expression as
"<tt>MkT</tt>" not as "<tt>$wMkT</tt>", but it really means the
The local let-binding reboxes the pair because it may be mentioned in
the case alternative. This may well be a bad idea, which is why
<tt>-funbox-strict-fields</tt> is an experimental feature.
......@@ -8,13 +8,76 @@
<h1>The GHC Commentary - Primitives and the Prelude</h1>
One of the trickiest aspects of GHC is the delicate interplay
between what knowledge is baked into the compiler, and what
knowledge it gets by reading the interface files of library
modules. In general, the less that is baked in, the better.
Most of what the compiler has to have wired in about primitives and
prelude definitions is in
GHC recognises these main classes of baked-in-ness:
<dt><strong>Primitive types.</strong>
<dd>Primitive types cannot be defined in Haskell, and are utterly baked into the compiler.
They are notionally defined in the fictional module <tt>GHC.Prim</tt>. The <tt>TyCon<tt>s for these types are all defined
in module <tt>TysPrim</tt>; for example,
intPrimTyCon :: TyCon
intPrimTyCon = ....
<tt>Int#, Float#, Addr#, State#</tt>.
<dt><strong>Wired-in types.</strong>
<dd>Wired-in types can be defined in Haskell, and indeed are (many are defined in </tt>GHC.Base</tt>).
However, it's very convenient for GHC to be able to use the type constructor for (say) <tt>Int</tt>
without looking it up in any environment. So module <tt>TysWiredIn</tt> contains many definitions
like this one:
intTyCon :: TyCon
intTyCon = ....
intDataCon :: DataCon
intDataCon = ....
However, since a <tt>TyCon</tt> value contains the entire type definition inside it, it follows
that the complete definition of <tt>Int</tt> is thereby baked into the compiler.
Nevertheless, the library module <tt>GHC.Base</tt> still contains a definition for <tt>Int</tt>
just so that its info table etc get generated somewhere. Chaos will result if the wired-in definition
in <tt>TysWiredIn</tt> differs from that in <tt>GHC.Base</tt>.
The rule is that only very simple types should be wired in (for example, <tt>Ratio</tt> is not,
and <tt>IO</tt> is certainly not). No class is wired in: classes are just too complicated.
Examples: <tt>Int</tt>, <tt>Float</tt>, <tt>List</tt>, tuples.
<dt><strong>Known-key things.</strong>
<dd>GHC knows of the existence of many, many other types, classes and values. <em>But all it knows is
their <tt>Name</tt>.</em> Remember, a <tt>Name</tt> includes a unique key that identifies the
thing, plus its defining module and occurrence name
(see <a href="names.html">The truth about Names</a>). Knowing a <tt>Name</tt>, therefore, GHC can
run off to the interface file for the module and find out everything else it might need.
Most of these known-key names are defined in module <tt>PrelNames</tt>; a further swathe concerning
Template Haskell are defined in <tt>DsMeta</tt>. The allocation of unique keys is done manually;
chaotic things happen if you make a mistake here, which is why they are all together.
All the <tt>Name</tt>s from all the above categories are used to initialise the global name cache,
which maps (module,occurrence-name) pairs to the globally-unique <tt>Name</tt> for that
thing. (See <tt>HscMain.initOrigNames</tt>.)
The next sections elaborate these three classes a bit.
<h2>Primitives (module <tt>TysPrim</tt>)</h2>
Some types and functions have to be hardwired into the compiler as they
are atomic; all other code is essentially built around this primitive
......@@ -51,7 +114,7 @@
TyCon</code> converts <code>PrimRep</code> values into the corresponding
type constructor.
<h2>The Prelude</h2>
<h2>Wired in types (module <tt>TysWiredIn</tt>)</h2>
In addition to entities that are primitive, as the compiler has to treat
them specially in the backend, there is a set of types, functions,
......@@ -84,6 +147,9 @@
as <code>mkListTy</code> and <code>mkTupleTy</code>, which construct
compound types.
<h2>Known-key names (module <tt>PrelNames</tt>)</h2>
All names of types, functions, etc. known to the compiler are defined in
......@@ -123,6 +189,8 @@ floatPrimTyConKey = mkPreludeTyConUnique 11</pre>
the parser, such as [], and code generated from deriving clauses), which
will take care of adding uniqueness information.
<h2>Gathering it all together (module <tt>PrelInfo</tt>)</h2>
The module
<a href=""><code>PrelInfo</code></a>
in some sense ties all the above together and provides a reasonably
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment