Skip to content

Grammar for types and data/newtype constructors (#18506)

Vladislav Zavialov requested to merge wip/disamb-td into master

Before this patch, we parsed types into a reversed sequence of operators and operands. For example, (F x y + G a b * X) would be parsed as [X, *, b, a, G, +, y, x, F], using a simple grammar:

tyapps
  : tyapp
  | tyapps tyapp

tyapp
  : atype
  | PREFIX_AT atype
  | tyop
  | unpackedness

Then we used a hand-written state machine to assemble this either into a type, using mergeOps, or into a constructor, using mergeDataCon.

This is due to a syntactic ambiguity:

data T1 a =          MkT1 a
data T2 a = Ord a => MkT2 a

In T1, what follows after the = sign is a data/newtype constructor declaration. However, in T2, what follows is a type (of kind Constraint). We don't know which of the two we are parsing until we encounter =>, and we cannot check for => without unlimited lookahead.

This poses a few issues when it comes to e.g. infix operators:

data I1 = Int :+ Bool :+ Char          -- bad
data I2 = Int :+ Bool :+ Char => MkI2  -- fine

By this issue alone we are forced into parsing into an intermediate representation and doing a separate validation pass.

However, should that intermediate representation be as low-level as a flat sequence of operators and operands?

Before GHC Proposal #229, the answer was Yes, due to some particularly nasty corner cases:

data T = ! A :+ ! B          -- used to be fine, hard to parse
data T = ! A :+ ! B => MkT   -- bad

However, now the answer is No, as this corner case is gone:

data T = ! A :+ ! B          -- bad
data T = ! A :+ ! B => MkT   -- bad

This means we can write a proper grammar for types, overloading it in the DisambECP style, see Note [Ambiguous syntactic categories].

With this patch, we introduce a new class, DisambTD. Just like DisambECP is used to disambiguate between expressions, commands, and patterns, DisambTD is used to disambiguate between types and data/newtype constructors.

This way, we get a proper, declarative grammar for constructors and types:

infixtype
  : ftype
  | ftype tyop infixtype
  | unpackedness infixtype

ftype
  : atype
  | tyop
  | ftype tyarg
  | ftype PREFIX_AT tyarg

tyarg
  : atype
  | unpackedness atype

And having a grammar for types means we are a step closer to using a single grammar for types and expressions.

Closes #18506 (closed).

Edited by Vladislav Zavialov

Merge request reports