ExpPatFrame Parser Refactoring
TheThis page outlines a new design for the expression/pattern parser and the motivation for it.
The Problem
There are places in the grammar where we do not know whether we are parsing an expression or a pattern without infinite lookahead (which we do not have in happy
):
- View patterns:
f (Con a b ) = ... -- 'Con a b' is a pattern
f (Con a b -> x) = ... -- 'Con a b' is an expression
- do-notation:
do { Con a b <- x } -- 'Con a b' is a pattern
do { Con a b } -- 'Con a b' is an expression
- Guards:
x | True <- p && q = ... -- 'True' is a pattern
x | True = ... -- 'True' is an expression
- Top-level value/function declarations (FunBind/PatBind):
f !a -- TH splice
f !a = ... -- function declaration
Until we encounter the =
sign, we don't know if it's a top-level TemplateHaskell splice where !
is an infix operator, or if it's a function declaration where !
is a strictness annotation.
The approach GHC uses is to parse patterns as expressions and rejig later. This turns out to be suboptimal:
-
We can't handle corner cases. For instance, the following function declaration LHS is not a valid expression (see #1087 (closed)):
!a + !b = ...
-
There are points in the pipeline where the representation is awfully incorrect. For instance,
f !a b !c = ...
is first parsed as
(f ! a b) ! c = ...
-
We have to extend HsExpr with pattern-specific constructs:
EAsPat
,EViewPat
,ELazyPat
, etc. It isn't particularly elegant and we don't want such constructors to show up in GHC API.
Backtracking with Parser Combinators
One might think we could avoid this issue by using a backtracking parser and doing something along the lines of try pExpr <|> pPat
. I proposed this in a ghc-devs thread: https://mail.haskell.org/pipermail/ghc-devs/2018-October/016291.html. The situation turned out to be more complex. As there can be patterns inside expressions (e.g. via case
, let
, do
) and expressions inside patterns (e.g. view patterns), naive backtracking would be devastating to performance (asymptotically).
Common Structure
There are common syntactic elements of expressions and patterns (e.g. both of them must have balanced parentheses), and we can capture this common structure in an intermediate data type, ExpPatFrame
:
data ExpPatFrame
= FrameVar RdrName
-- ^ Identifier: Just, map, BS.length
| FrameIPVar HsIPName
-- ^ Implicit parameter: ?x
| FrameOverLabel FastString
-- ^ Overloaded label: #label
| FrameTuple [LTupArgFrame] Boxity
-- ^ Tuple (section): (a,b) (a,b,c) (a,,) (,a,)
| FrameViewPat LExpPatFrame LExpPatFrame
-- ^ View pattern: e -> p
| FrameTySig LExpPatFrame (LHsSigWcType GhcPs)
-- ^ Type signature: x :: ty
| FrameAsPat LExpPatFrame LExpPatFrame
-- ^ As-pattern: x@(D a b)
...
ExpPatFrame
is a union of all syntactic elements between expressions and patterns, so it includes both expression-specific constructs (e.g. overloaded labels) and pattern-specific constructs (e.g. view/as patterns).
As soon as we have parsed far enough to decide whether it is an expression or a pattern, we can convert to HsExpr
or HsPat
accordingly. In the future we might want to extend 'ExpPatFrame' (expression/pattern frame) to ExpPatTyFrame
(expression/pattern/type frame), because with Dependent Haskell (or even smaller features, such as visible dependent quantification in terms) we will have types inside expressions and expressions inside types.
The nice thing about having a dedicated type such as ExpPatFrame
is keeping parsing concerns local to the parser: no matter what hacky intermediate structures we add to ExpPatFrame
, we can keep HsExpr
and HsPat
clean.
Trees that Grow
During the discussion of Phab:D5408, an alternative plan came up: create a new GHC pass, GhcPrePs
, and extend HsExpr GhcPrePs
with pattern-specific and command-specific constructors. Then disambiguate during the conversion from GhcPrePs
to GhcPs
.
The reason this design does not work is that some parts of HsExpr
should be disambiguated much sooner than we parse an expression in its entirety. We have:
data HsExpr p
...
| HsLet ...
(LHsLocalBinds p)
...
data Match p body
= Match {
...
m_pats :: [LPat p],
...
}
data GRHSs p body
= GRHSs {
...
grhssLocalBinds :: LHsLocalBinds p
}
data StmtLR idL idR body
...
| BindStmt
(LPat idL)
...
Imagine we're parsing a HsExpr GhcPrePs
– it will contain LHsLocalBinds GhcPrePs
and LPat GhcPrePs
. Converting them to GhcPs
is extra code and extra runtime – we don't want that. Instead, in ExpPatFrame
we store LHsLocalBinds GhcPs
and LPat GhcPs
in corresponding places. Therefore, ExpPatFrame
does not constitute a proper pass: we pre-parse little fragments that store GhcPs
subtrees and then convert these fragments to HsExpr GhcPs
, HsPat GhcPs
, or HsPat GhcPs
.
Embedding Common Terms
We (simonpj, int-index, goldfire, and I (Shayan)) had an email discussion to overview the design space and the results are documented as the following.
-
The process of parsing includes two subprocesses of parsing ambiguous bits into small in-between datatypes (in linear time), and then resolving ambiguities and converting (in linear time) from the in-between datatypes to the relevant
HsSyn
types ofGhsPs
pass. -
Q) What are the other, besides the ones involved D5408, noticeable ambiguities that could benefit from an in-between datatype?
A)There are [other] ambiguities when parsing data declarations, resolved them with an intermediate data type, too, in
D5180
.
For example the following definition of
TyEl
:
data TyEl = TyElOpr RdrName | TyElOpd (HsType GhcPs)
| TyElTilde | TyElBang
| TyElUnpackedness ([AddAnn], SourceText, SrcUnpackedness)
| TyElDocPrev HsDocString
- I proposed to consider the possibility of defining a common core "pre-expression" where expressions, patterns and commands are made of.
It would possibly help to define the three as a composition of the common core and the bits specific them. Then we can have declarations similar to
data HsTerm trm
= ... -- the common term constructs as in your `ExpPatFrame`
data HsExpr x
= TermE (HsTerm (HsExpr x))
| ... -- the other constructors specific to expressions
data Pat x
= TermP (HsTerm (Pat x))
| ...
... -- same for commands
There are different possible variations to above, but this is essentially to say
HsExpr = Common Core + Other expression-specific constructs
Pat = Common Core + Other pattern-specific constructs
(Notice one can split things up all the way to achieve one datatype-per-constructor, but a couple of years ago we've found it overkilling)
We identified the following pros and cons.
Pros are that there will be fewer constructors, and more importantly, parts that look the same may be pretty-printed (or possibly renamed, desugared, ...) the same.
Cons are that there will be one more matching to go under
TermX
constructor and such nesting has a (possibly negligible) performance impact.
ExpPatFrame
Minimizing We'd like to keep ExpPatFrame
as small as possible. It means that instead of duplicating all of HsExpr
and HsPat
constructors in it, we'd rather embed them directly when unambiguous. For example, patterns cannot contain if
, case
, or do
, so we'd rather have this:
data ExpPatFrame
= ...
| ...
| FrameExpr (HsExpr GhcPs)
than this:
data ExpPatFrame
= ...
| ...
| ...
| FrameIf LExpPatFrame LExpPatFrame LExpPatFrame
-- ^ If-expression: if p then x else y
| FrameMultiIf [LFrameGRHS]
-- ^ Multi-way if-expression: if | p = x \n | q = x
| FrameCase LExpPatFrame [LFrameMatch]
-- ^ Case-expression: case x of { p1 -> e1; p2 -> e2 }
| FrameDo (HsStmtContext Name) [LFrameStmt]
-- ^ Do-expression: do { s1; a <- s2; s3 }
...
data FrameStmt
= ...
| ...
| ...
| FrameBindStmt (LPat GhcPs) LExpPatFrame
-- ^ Binding statement: p <- e
| FrameBodyStmt LExpPatFrame
-- ^ Body statement: e
| FrameLetStmt (LHsLocalBinds GhcPs)
-- ^ Let statement: let p = t
...
Unfortunately, while do
or let
cannot be used in patterns, they can be used in commands, so we end up duplicating most of HsExpr
constructors for the sake of HsCmd
. If not for this, we'd be able to make ExpPatFrame
smaller.
Nevertheless, in the final iteration we will include constructors for unambiguous cases:
data ExpPatFrame
= ...
| ...
| FrameExpr (HsExpr GhcPs) -- unambiguously an expression
| FramePat (HsPat GhcPs) -- unambiguously a pattern
| FrameCommand (HsCmd GhcPs) -- unambiguously a command
Hopefully, this will allow us to remove at least some duplication, even if not much.
Implementation Plan
Initially I (int-index) tried to do the refactoring and fix #1087 (closed) simultaneously, and I have a semi-working proof-of-concept passing (almost all) tests: https://github.com/serokell/ghc/commit/39c5db2d77c96d4b0962f581b60908343edf8624
This prototype also revealed that doing the ExpPatFrame
refactoring allows for better error messages in some cases. Compare:
GHCi, version 8.4.3: http://www.haskell.org/ghc/ :? for help
Prelude> foo# = ()
<interactive>:1:6: error:
parse error on input '='
Perhaps you need a 'let' in a 'do' block?
e.g. 'let x = 5' instead of 'x = 5'
and
GHCi, version 8.7.20181026: http://www.haskell.org/ghc/ :? for help
Prelude> foo # = ()
<interactive>:1:5: error:
Operator is missing a right-hand side argument: #
Prelude> foo# = ()
<interactive>:2:4: error:
Operator is missing a right-hand side argument: #
Perhaps you meant to enable MagicHash?
However, after a few hellish rebases, I decided to split this effort into many smaller patches. Here's the new plan:
- Introduce
ExpPatFrame
with as little churn as possible. - Use it to clean up the definition of
HsExpr
, removingEAsPat
,EWildPat
,EViewPat
,ELazyPat
. Perhaps alsoHsArrApp
andHsArrForm
. - Use it as the basis for fixing #1087 (closed)
- Investigate its viability for term/type parser unification