Skip to content

Expression/command ambiguity resolution

Vladislav Zavialov requested to merge wip/exp-cmd-frame into master

Expression/command ambiguity (arrow syntax)

There are places in the grammar where we do not know whether we are parsing an expression or a command without infinite lookahead:

  Compare:  proc x -> do { (stuff) -< x }   -- 'stuff' is an expression
            proc x -> do { (stuff) }        -- 'stuff' is a command

Until we encounter arrow syntax -< we don't know whether to parse (stuff) as an expression or a command.

The approach GHC used before was to parse commands as expressions and rejig later. This is suboptimal because we had to extend HsExpr with command-specific constructs: HsArrForm and HsArrApp, polluting one of the basic definitions.

A less invasive option is to add these constructors only in the GhcPs pass. However, GhcPs corresponds to the output of parsing, not to its intermediate results, so we wouldn't want them there either.

Creating a new GhcPrePs pass would significantly bloat conversion code and slow down the compiler by adding another linear-time pass over the entire AST. For example, in order to build HsExpr GhcPrePs, we would need to build HsLocalBinds GhcPrePs (as part of HsLet), and we never want HsLocalBinds GhcPrePs.

The solution that keeps basic definitions (such as HsExpr) clean and keeps the concern local to the parser is to parse into a dedicated intermediate representation, but what representation?

  • Observation I:

    Expressions and commands are disjoint. There are no user inputs that could be interpreted as either an expression or a command depending on outer context. 5 is definitely an expression, x -< y is definitely a command.

    Even though we have both HsLam and HsCmdLam, we can look at the body to disambiguate:

      \p -> 5          -- definitely an expression
      \p -> (x -< y)   -- definitely a command

    This means we could use a bottom-up flow of information to determine whether we are parsing an expression or a command, using a sum type for intermediate results:

      Either (LHsExpr GhcPs) (LHsCmd GhcPs)
  • Observation II:

    Bottom-up flow of information leads to poor error messages. Consider

      if ... then 5 else (x -< y)

    Do we report that 5 is not a valid command or that x -< y is not a valid expression? It depends on whether we want the entire node to be HsIf or HsCmdIf, and this information is non-local.

    Therefore, we must use top-down flow of information. We represent both options and let the caller pick the appropriate one:

      (Either SDoc (LHsExpr GhcPs), Either SDoc (LHsCmd GhcPs))

    This leads to four possibilities:

      (Right exp, Left msg)     -- definitely an expression
      (Left msg, Right cmd)     -- definitely a command
      (Right exp, Right cmd)    -- either a command or an expression
      (Left msg1, Left msg2)    -- neither a command nor an expression

    Per observation I, we do not truly have the possibility of (Right exp, Right cmd), but being able to represent it brings no harm and might come in handy in the future.

  • Observation III:

    There are effects during parsing besides failure. For example, we log warnings and API annotations. We achieve maximum flexibility by using the P monad directly instead of Either SDoc:

      (P (LHsExpr GhcPs), P (LHsCmd GhcPs))

    Since we are parsing into this representation, in essence we have two layers of P:

      expParser :: P (P (LHsExpr GhcPs), P (LHsCmd GhcPs))

    The outer layer corresponds to the initial pass over the data, while the inner layer represents actions that we perform after we have determined whether we are in an expression on a command context.

  • Observation IV:

    When parsing case alternatives, do statements, and so on, we do not want to duplicate code for expressions and commands.

    That is, rather than have

      (codeForExprs ...  ...  ...,
       codeForCommands ...  ...  ...)

    We would rather write a single polymorphic definition:

      codeForBoth

    The naive pair-based representation is not sufficient for this, but there's nothing that a dozen of GHC extensions cannot fix. See the definitions of ExpCmdP and ExpCmdI - their essence is still a pair of parsers, but we use RankNTypes and GADTs to allow code sharing between pair components.

Edited by Vladislav Zavialov

Merge request reports