Skip to content

TTG: the language-haskell package

This ticket describes the plan and follows the implementation of separating the Haskell AST and parsing stage to packages haskell-syntax and haskell-parser to achieve modularity and code reuse across clients

We propose the following goals to reach this package separation:

  1. A module hierarchy Language.Haskell.Syntax which contains the base Haskell AST. Just the data types, very few functions, since there is virtually nothing you can do without knowing any of the type family instances.
  2. A module hierarchy Language.Haskell.Parser that defines HsExpr Parsed, and has the properties enumerated below. Presumably this package buys into the full Anno infrastructure to decorate the parse tree.
  3. Rejig GHC to use HsExpr Parsed rather than HsExpr (GhcPass Parsed).
  4. Move the first two from ghc to an ghc-haskell-syntax package.
  5. Figure out what needs to be done to create a ghc-haskell-parser package, later.

Properties:

  1. A (pre-processed) string can be parsed into HsExpr Parsed. Let's call this operation parse. (By "pre-processed", we mean that there is no CPP left in the input string.)
  2. HsExpr Parsed can be pretty-printed to a string. Let's call this operation print.
  3. parse and print are exact inverses, in both directions.
  4. HsExpr Parsed is a suitable data structure to use as the input to a compilation pipeline. It is structured and recursive.
  5. HsExpr Parsed depends on a minimum of infrastructure. In particular, no aspects of e.g. type-checking or code generation should be depended on by any module defining the HsExpr type, nor the parse or print functions.
  6. HsExpr is extensible, fitting with the TTG framework for extension in other phases; the unextended HsExpr contains the essence of the Haskell AST, with the parts that are useful to many syntax processors.

For (1) we need to remove all dependencies of GHC from all L.H.S modules. This is well underway (move HsModule) !8228 (closed) (move MatchGroup Origin field) !8301 (merged) (TTG for splices) !7821 (closed) (removing other dependencies) !8308 (closed) (dealing with FastString) #21628. We need to keep reviewing these and enforcing a criteria: the base AST should represent the complexity of the source code, but doesn't need to keep e.g. annotations, that's a Parser extension to the AST.

For (2) we should start moving the parser modules to Language.Haskell.Parser. I don't think we should do this yet, (1) must be complete first

For (3) we need (2)

For (4), we can have it before (3), but need (1,2).

Let's keep tuning this description with the plan.


For history, here is the original ticket description:

This issue follows a previous discussion in a TTG related MR.

I want to further this discussion, in particular because of some design decisions concerning my MR !8228 (closed) (see below)

Quoting @rae in !4782 (comment 415958):

I want to add a small new thought: What if the result of > parsing weren't GhcPs, but Parsed. That is, we have a client-independent phase indicator Parsed :: Type that lives in > > the client-independent space. GhcPass would just encompass GhcRn and GhcTc. For me, the fact that you can parse into > > and pretty-print from an AST is its fundamental property, and a great part to start any design discussion.

One challenge is that, to really have the parse/pretty-print property, you need information about whitespace, comments, > keyword spellings, unicode syntax, and such. This information is mostly useless after parsing. Since our TTG structure offers no way to remove fields, putting this information in the base tree might make it less usable downstream. That > > might be why we use extension points for this data today.

So maybe I advocate for a compromise position: put Parser as a general, client-independent pass. All the source annotation stuff would also live in the client-independent space. But design the AST so that there are extensions for Parser, around annotations it is unlikely for clients to need. It would be a judgment call around what information is likely useless to clients (for example, one could argue that the distinction between (x f) and f x is "useless"), but I think we are likely to make mostly good judgments around this.

I'm currently working in moving HsModule to L.H.S in !8228 (closed), and when dealing with GHC.Hs.ImpExp, I arrive at this:

-- | Imported or exported entity.
data IE pass
  = IEVar       (XIEVar pass) (LIEWrappedName pass (IdP pass))
  ...
  | IEGroup             (XIEGroup pass) Int (LHsDoc pass) -- ^ Doc section heading
  | IEDoc               (XIEDoc pass) (LHsDoc pass)       -- ^ Some documentation
  | IEDocNamed          (XIEDocNamed pass) String    -- ^ Reference to named doc

The first reason why I brought this discussion up again is because of HsDoc. I was trying to understand whether IEGroup and IEDoc should be moved to the GHC specific part, or whether HsDoc should be moved to the client independent stage.

In light of the comment advocating the independent Parsed stage which is also independent from the independent base AST, I would say that HsDoc would be a good example of something to be put in the independent Parsed stage, but not in the base AST.

For now I'll move the constructors depending on HsDoc to the GHC specific part, and if we later on pick up the Parsed idea, we can move them to there.

I mixed a bit of !8228 (closed) talk here, but to motivate further discussion. For comments regarding this specific TTG Module x HsDoc discussion use the linked MR

Thank you, ~romes

Edited by Rodrigo Mesquita
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information