Where should "tokens" live in the abstract syntax tree?

Language.Haskell.Syntax is a compiler-independent data type for the Haskell abstract syntax tree. It is designed to be extensible using Trees that Grow.

The question that this ticket addresses is where should we store information about the precise position of the keywords and punctuation of the program?.

Progress:

!11716 (closed): move tokens for HsLet into the extension field, and EpAnn stuff into the <xrec-stuff> field
!11756 (closed): move tokens into GhcPs extension fields

There has been some discussion in the past:

The "API annotations" wiki page
#19623 (closed)
#22558 (closed)
MRs in flight: !9476 (closed) !9477 (closed)
ghc-devs discussion thread (July 23)

Tokens

We use the term tokens for the "keywords and punctuation". We already have the type HsToken defined in Language.Haskell.Syntax, defined as follows:

type LHsToken tok p = XRec p (HsToken tok)
data HsToken (tok :: Symbol) = HsTok

So LHsToken p "wombat" represents the keyword wombat, with the "wombat" in the type giving some helpful documentation. The main payload is the XRec part which allows a client to record the location of the token.

Motivation

Why do we want to store those tokens in the syntax tree at all? Use cases:

Refactoring tools could parse the source program, modify a small part of it, and print it back into the source file. The formatting of unmodified parts should be preserved, so we need the locations of every token (that's called "exactprinting").
Haddock needs to associate documentation comments with AST nodes. Doing so in the parser is very difficult, so we just accumulate the comments in a list and insert them back into the tree in a separate pass. We need token location information to do this.

In other cases, those tokens are an annoyance:

template-haskell, as well as any other GHC API client that generates ASTs, doesn't have token locations and has to fill them with noHsTok.
The renamer, the type checker, and the desugarer have no use for those tokens. Passing them around is a distraction from the actual renaming/type-checking/desugaring logic.

Possible Approaches

There are two general approaches:

Token Plan A. Tokens are not part of the abstract syntax tree, and do not belong in Language.Haskell.Syntax at all. If you want to store that stuff, do it in an extension field.
Token Plan B. It is often helpful to be able to reproduce precisely what the programmer wrote (so called "exact-print"). That means knowing precisely where the keywords and punctuation were. Rather than duplicating this rendering/pretty-printing code separately for each tool, it would be nice to do it once, in Language.Haskell.Syntax

One might argue that this makes our AST less abstract, so it’s actually a concrete syntax tree. But Language.Haskell.Syntax already retain some information uncharacteristic of a proper AST, such as parentheses (with HsPar), so adding token information is arguably appropriate.

Currently in GHC HEAD we have mainly Plan A, with a spinkling of Plan B. For example:

data HsExpr p
  = ...
  | HsPar       (XPar p)
               !(LHsToken "(" p)
                (LHsExpr p)  -- ^ Parenthesised expr; see Note [Parens in HsSyn]
               !(LHsToken ")" p)

But we have no clear decision or plan. Hence this ticket.

Details about Plan A

To put the token information in the extension fields, a client of Language.Hasekll.Syntax would do something like this. Here is the declaration of HsExpr:

  data HsExpr p
    = ...
    | HsLet       (XLet p)
                  (LHsLocalBinds p)
                  (LHsExpr  p)

The question is then downstream API users should consume these annotations while still being able to extend the AST themselves. I think this can be accomplished by introducing a new pass transformer, WithExact:

-- | A pass @p@ augmented with information necessary for exact-printing.
data WithExact p

We can then introduce the appropriate type family instances to capture tokens as necessary. For instance, let might look like:

type instance XLet (WithExact p) = (XLet p, (LHsToken p "let", LHsToken p "in"))

The various GHC passes would then be defined as:

type GhcPs = WithExact (GhcPass 'Parsed)
type GhcRn = WithExact (GhcPass 'Renamed)
type GhcTc = WithExact (GhcPass 'Typechecked)

Now the extension field is always a pair, of the previous XLet p information, and a tuple of tokens. If there are no tokens for a constructor K one could say

type instance XK (WithExact p) = XK p

Alternatively one could use a data type with named fields:

type instance XLet (WithExact p) = ExactLet p
data ExactLet p = ExactLet { exactLetLet :: !(LHsToken p "let")
                           , exactLetIn  :: !(LHsToken p "in")
                           , exactLetX   :: XLet p
                           }

but that seems overkill: the tokens are already self-documenting.

Details about Plan B

In Plan B we directly put the tokens in the tree, not in extension fields. We can do so using two different stles:

Token Plan B1: keep the tokens together in a tuple.
Token Plan BN: spread the tokens across the data constructor in suggestive places.

For example, for HsLet here is what Plan B1 looks like:

  data HsExpr p
    = ...
      ...
    | HsLet       (XLet p)
+                 (LHsToken p "let", LHsToken p "in")
                  (LHsLocalBinds p)
                  (LHsExpr  p)

And here is the same for Plan BN:

  data HsExpr p
    = ...
      ...
    | HsLet       (XLet p)
+                 (LHsToken p "let")
                  (LHsLocalBinds p)
+                 (LHsToken p "in")
                  (LHsExpr  p)

Comparing plans

Plan A advantages:

Clients can completely ignore all the exact-print stuff. With Plan B they have to handle those fields, if only to pass them on. With Plan B1 that is not too bad (one field), but it's pretty tiresome with Plan BN.
Runtime: Plan A doesn't have to pay for exact-print information if it doesn't use it. Plan B allocates more: every data constructor gets more fields, and each pass needs to copy those fields into a new copy of the construtor. Plan B1 is better than Plan BN in this respect.
Generated code: some clients (such as GHC) generate HsSyn, e.g. by desugaring source. For this generated code, the location of the tokens makes no sense. Plan A does not force programmers to invent fake tokens; Plan B does.

Plan B advantages:

The Big Adantage is to be able to write a single, client-independent exact-print pretty-printer.
The data type declaration for Plan BN looks quite perspicuous: the tokens appear in the data type interspersed with the non-token arguments, just as in the concrete syntax.
When a GHC pass uses the extension field, it doesn't need to worry about pairing it up with the exact-print information.

Missing information

The main benefit of Plan B is that we can make a single exact-print implementation, in Language.Haskell.Syntax. But that means more than putting LHsToken in the tree: it means that exact-print has to be able to get SrcSpans out of XRec. How does it do that? We need to see that design; otherwise we don't know if we'll get the payoff.

Edited Dec 12, 2023 by Vladislav Zavialov

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information