TTG: the language-haskell package
This ticket describes the plan and follows the implementation of separating the Haskell AST and parsing stage to packages haskell-syntax
and haskell-parser
to achieve modularity and code reuse across clients
We propose the following goals to reach this package separation:
- A module hierarchy
Language.Haskell.Syntax
which contains the base Haskell AST. Just the data types, very few functions, since there is virtually nothing you can do without knowing any of the type family instances. - A module hierarchy
Language.Haskell.Parser
that definesHsExpr Parsed
, and has the properties enumerated below. Presumably this package buys into the fullAnno
infrastructure to decorate the parse tree. - Rejig GHC to use
HsExpr Parsed
rather thanHsExpr (GhcPass Parsed)
. - Move the first two from
ghc
to anghc-haskell-syntax
package. - Figure out what needs to be done to create a
ghc-haskell-parser
package, later.
Properties:
- A (pre-processed) string can be parsed into
HsExpr Parsed
. Let's call this operationparse
. (By "pre-processed", we mean that there is no CPP left in the input string.) -
HsExpr Parsed
can be pretty-printed to a string. Let's call this operationprint
. -
parse
andprint
are exact inverses, in both directions. -
HsExpr Parsed
is a suitable data structure to use as the input to a compilation pipeline. It is structured and recursive. -
HsExpr Parsed
depends on a minimum of infrastructure. In particular, no aspects of e.g. type-checking or code generation should be depended on by any module defining theHsExpr
type, nor theparse
orprint
functions. -
HsExpr
is extensible, fitting with the TTG framework for extension in other phases; the unextendedHsExpr
contains the essence of the Haskell AST, with the parts that are useful to many syntax processors.
For (1) we need to remove all dependencies of GHC from all L.H.S modules. This is well underway (move HsModule) !8228 (closed) (move MatchGroup Origin field) !8301 (merged) (TTG for splices) !7821 (closed) (removing other dependencies) !8308 (closed) (dealing with FastString) #21628. We need to keep reviewing these and enforcing a criteria: the base AST should represent the complexity of the source code, but doesn't need to keep e.g. annotations, that's a Parser extension to the AST.
For (2) we should start moving the parser modules to Language.Haskell.Parser
. I don't think we should do this yet, (1) must be complete first
For (3) we need (2)
For (4), we can have it before (3), but need (1,2).
Let's keep tuning this description with the plan.
For history, here is the original ticket description:
This issue follows a previous discussion in a TTG related MR.
I want to further this discussion, in particular because of some design decisions concerning my MR !8228 (closed) (see below)
Quoting @rae in !4782 (comment 415958):
I want to add a small new thought: What if the result of > parsing weren't GhcPs, but Parsed. That is, we have a client-independent phase indicator Parsed :: Type that lives in > > the client-independent space. GhcPass would just encompass GhcRn and GhcTc. For me, the fact that you can parse into > > and pretty-print from an AST is its fundamental property, and a great part to start any design discussion.
One challenge is that, to really have the parse/pretty-print property, you need information about whitespace, comments, > keyword spellings, unicode syntax, and such. This information is mostly useless after parsing. Since our TTG structure offers no way to remove fields, putting this information in the base tree might make it less usable downstream. That > > might be why we use extension points for this data today.
So maybe I advocate for a compromise position: put Parser as a general, client-independent pass. All the source annotation stuff would also live in the client-independent space. But design the AST so that there are extensions for Parser, around annotations it is unlikely for clients to need. It would be a judgment call around what information is likely useless to clients (for example, one could argue that the distinction between (x
f
) and f x is "useless"), but I think we are likely to make mostly good judgments around this.
I'm currently working in moving HsModule to L.H.S in !8228 (closed), and when dealing with GHC.Hs.ImpExp, I arrive at this:
-- | Imported or exported entity.
data IE pass
= IEVar (XIEVar pass) (LIEWrappedName pass (IdP pass))
...
| IEGroup (XIEGroup pass) Int (LHsDoc pass) -- ^ Doc section heading
| IEDoc (XIEDoc pass) (LHsDoc pass) -- ^ Some documentation
| IEDocNamed (XIEDocNamed pass) String -- ^ Reference to named doc
The first reason why I brought this discussion up again is because of HsDoc
. I was trying to understand whether IEGroup
and IEDoc
should be moved to the GHC specific part, or whether HsDoc
should be moved to the client independent stage.
In light of the comment advocating the independent Parsed
stage which is also independent from the independent base AST, I would say that HsDoc
would be a good example of something to be put in the independent Parsed
stage, but not in the base AST.
For now I'll move the constructors depending on HsDoc
to the GHC specific part, and if we later on pick up the Parsed
idea, we can move them to there.
I mixed a bit of !8228 (closed) talk here, but to motivate further discussion. For comments regarding this specific TTG Module x HsDoc discussion use the linked MR
Thank you, ~romes