Expression/command ambiguity resolution
Expression/command ambiguity (arrow syntax)
There are places in the grammar where we do not know whether we are parsing an expression or a command without infinite lookahead:
Compare: proc x -> do { (stuff) -< x } -- 'stuff' is an expression
proc x -> do { (stuff) } -- 'stuff' is a command
Until we encounter arrow syntax -<
we don't know whether to parse (stuff)
as
an expression or a command.
The approach GHC used before was to parse commands as expressions and rejig
later. This is suboptimal because we had to extend HsExpr
with command-specific
constructs: HsArrForm
and HsArrApp
, polluting one of the basic definitions.
A less invasive option is to add these constructors only in the GhcPs
pass.
However, GhcPs
corresponds to the output of parsing, not to its intermediate
results, so we wouldn't want them there either.
Creating a new GhcPrePs
pass would significantly bloat conversion code and slow
down the compiler by adding another linear-time pass over the entire AST. For
example, in order to build HsExpr GhcPrePs
, we would need to build
HsLocalBinds GhcPrePs
(as part of HsLet
), and we never want HsLocalBinds GhcPrePs
.
The solution that keeps basic definitions (such as HsExpr
) clean and keeps
the concern local to the parser is to parse into a dedicated intermediate
representation, but what representation?
-
Observation I:
Expressions and commands are disjoint. There are no user inputs that could be interpreted as either an expression or a command depending on outer context.
5
is definitely an expression,x -< y
is definitely a command.Even though we have both
HsLam
andHsCmdLam
, we can look at the body to disambiguate:\p -> 5 -- definitely an expression \p -> (x -< y) -- definitely a command
This means we could use a bottom-up flow of information to determine whether we are parsing an expression or a command, using a sum type for intermediate results:
Either (LHsExpr GhcPs) (LHsCmd GhcPs)
-
Observation II:
Bottom-up flow of information leads to poor error messages. Consider
if ... then 5 else (x -< y)
Do we report that
5
is not a valid command or thatx -< y
is not a valid expression? It depends on whether we want the entire node to beHsIf
orHsCmdIf
, and this information is non-local.Therefore, we must use top-down flow of information. We represent both options and let the caller pick the appropriate one:
(Either SDoc (LHsExpr GhcPs), Either SDoc (LHsCmd GhcPs))
This leads to four possibilities:
(Right exp, Left msg) -- definitely an expression (Left msg, Right cmd) -- definitely a command (Right exp, Right cmd) -- either a command or an expression (Left msg1, Left msg2) -- neither a command nor an expression
Per observation I, we do not truly have the possibility of
(Right exp, Right cmd)
, but being able to represent it brings no harm and might come in handy in the future. -
Observation III:
There are effects during parsing besides failure. For example, we log warnings and API annotations. We achieve maximum flexibility by using the
P
monad directly instead ofEither SDoc
:(P (LHsExpr GhcPs), P (LHsCmd GhcPs))
Since we are parsing into this representation, in essence we have two layers of P:
expParser :: P (P (LHsExpr GhcPs), P (LHsCmd GhcPs))
The outer layer corresponds to the initial pass over the data, while the inner layer represents actions that we perform after we have determined whether we are in an expression on a command context.
-
Observation IV:
When parsing
case
alternatives,do
statements, and so on, we do not want to duplicate code for expressions and commands.That is, rather than have
(codeForExprs ... ... ..., codeForCommands ... ... ...)
We would rather write a single polymorphic definition:
codeForBoth
The naive pair-based representation is not sufficient for this, but there's nothing that a dozen of GHC extensions cannot fix. See the definitions of
ExpCmdP
andExpCmdI
- their essence is still a pair of parsers, but we use RankNTypes and GADTs to allow code sharing between pair components.