Representation polymorphic datatypes
GHC's representation polymorphism (the ability to vary the blah
in a TYPE blah
) is quite restrictive: variables must have a known representation, arguments must have a known representation, and fields in a datatype must have a known representation. This ticket is about a way to lift that last restriction. This would allow e.g. [3#, 4#, 5#] :: [Int#]
and (3#, 3.14##, "hello")
.
The key challenge in representation-polymorphic data is that every data constructor is associated with an info table, a small structure describing the contents of a memory block tagged with that constructor. This info table is critical to the garbage collector, which must know which slots of memory have pointers in them. If a cons cell holds a normal Type
, the cell contains two pointers. If it holds an Int#
, though, the first word is not a pointer and must not be traversed by the garbage collector. Accordingly, in order to support representation polymorphism in data types, we must have a way of having multiple different info tables associated with a single data constructor.
To first approximation, then, that would be the solution: at every use of a constructor, the type checker would have to figure out what representation to specialize it at (possibly involving multiple variables, such as for tuples). Then, the actual memory cell allocated would point to the info table appropriate for the particular specialization of the data constructor.
The challenge is that there is actually an unbounded number of different representations (because of unboxed tuples and sums), and we don't want to create a fresh info table each time. So, upon creation of a data-constructor cell, we consult a master info table table that lists all the info tables that have been created thus far. This info table table would be keyed by the particular choice of representations, and it would store a pointer to the regular info table to be pointed to. Upon construction, we look up in the info table table to get the info table pointer. If the particular specialization at hand hasn't been used yet, we dynamically create the right info table.
But that's too costly: it means extra runtime steps at every use of a data constructor. This is unacceptable. So we acknowledge that Type
is the most common representation; for every data constructor, we pre-allocate (at program startup time, just as now) info tables for Type
. Then, when a data constructor is specialized to Type
(as most will be most of the time), we just point straight to the pre-allocated info table (just as now). No extra steps. So the extra work is only when the data constructor is specialized to non-Type
. We might imagine further optimizations that do some analysis to see other info tables that should be pre-allocated, but that's not necessary in the first version of this idea.
One remaining unknown is when to deallocate the info tables -- but that's not so bad, actually: we just deallocate the info tables when there are no relevant data constructor cells in memory that point to them, achievable by a normal mark-and-sweep process. (There would need to be an extra step to clear out the entry in the info table table.)
I don't plan on working on this anytime soon, but this all spilled out of a conversation @simonpj and I had, and it seems considerably easier to implement than full representation-polymorphism, and so I thought I'd write it up. If this is of interest to you (and e.g. you might be able to implement), let's talk.