The Parser
[Very incomplete. Please extend as you learn more.]
The parser is written using
- Alex, for lexical analysis. Source file compiler/GHC/Parser/Lexer.x
- Happy, for the parser itself. Source file compiler/GHC/Parser.y.
-
GHC.Parser.PostProcess
(formerly known asRdrHsSyn
), for Haskell support functions. Source file compiler/GHC/Parser/PostProcess.hs - Read Shayne Fletcher's great blog post about how GHC deals with ambiguous situations, and the
DisambECP
class.
Principles
Making a parser parse precisely the right language is hard. So GHC's parser follows the following principle:
- We often parse "over-generously", and filter out the bad cases later.
Here are some examples:
-
Patterns are parsed as expressions, and transformed from
GHC.Hs.Expr.HsExp
intoGHC.Hs.Pat.HsPat
inGHC.Parser.PostProcess.checkPattern
. An expression like[x | x<-xs]
that doesn't look like a pattern is rejected bycheckPattern
. (ToDo: Update to the latest information) -
The context of a type is parsed as a type, and then converted into a context by
GHC.Parser.PostProcess.checkContext
. For example, when parsingf :: (Read a, Num a) => a -> a
the parser can only discover that
(Read a, Num a)
is a context, rather than a type, when it meets the=>
. That requires infinite lookahead. So instead we parse(Read a, Num a)
as a tuple type, and then convert it to a context when we see the=>
.
Sometimes the over-generous parsing is only dealt with by the renamer. For example:
- Infix operators are parsed as if they were all left-associative. The renamer uses the fixity declarations to re-associate the syntax tree.
There are plenty more examples. A good feature of this approach is that the error messages later in compilation tend to produce much more helpful error messages. Errors generated by the parser itself tend to say "Parse error on line X" and not much more.
The main point is this. If you are changing the parser, feel free to make it accept more programs than it does at the moment, provided you also add a later test that rejects the bad programs. Typically you need this flexibility if some new thing you want to add makes the parse ambiguous, and you need more context to disambiguate. Delicate hacking of the LR grammar is to be discouraged. It's very hard to maintain and debug.
Avoiding right-recursion
Be sure to read this section of the Happy manual for tips on avoiding right recursion. In GHC, the preferred method is using a left-recursive OrdList
, as below:
foolist :: { OrdList Foo }
: foolist ',' foo { $1 `appOL` unitOL $3 }
| foolist ',' { $1 } -- optional, but recommended
| foo { unitOL $1 }
OrdList
operationally works the same way as building a list in reverse (as in the Happy manual), but it makes it less likely you'll forget to call reverse
when you need to get the final
list out.
One interesting, non-obvious fact, is that if you *do* use a right-recursive parser, the "extra semi-colons" production should NOT be pluralized:
foolist :: { [Foo] }
: foo ',' foolist { $1 : $3 }
| foo ',' { [$1] } -- NOT foolist
| foo { [$1] }
Indentation
Probably the most complicated interaction between the lexer and parser is with regards to whitespace-sensitive layout. The most important thing to know is that the lexer understands layout, and will output virtual open/close curlies (productions vocurly
and vccurly
) as well as semicolons, which can then be used as part of productions in Parser.y
. So for example, if you are writing a rule that will make use of indentation, you should accept both virtual and literal curlies:
body :: { ([AddAnn]
,([LImportDecl RdrName], [LHsDecl RdrName])) }
: '{' top '}' { (moc $1:mcc $3:(fst $2)
, snd $2) }
| vocurly top close { (fst $2, snd $2)
Notice the use of close
rather than vccurly
: close
is a production that accepts both vccurly
and a Happy error
; that is, if we encounter an error in parsing, we try exiting an indentation context and trying again. This ensures, for example, that the top-level context can be closed even if no virtual curly was output.
The top-level of a Haskell file does not automatically have a layout context; when there is no module
keyword, a context is implicitly pushed using missing_module_keyword
.
When writing grammars that accept semicolon-separated sequences, be sure to include a rule allowing for trailing semicolons (see the previous section), otherwise, you will reject layout.
Syntax extensions
Many syntactic features must be enabled with a LANGUAGE
flag, since they could cause existing Haskell programs to stop compiling, as turn some identifiers into keywords. We primarily affect this change of behavior in the lexer, by turning on/off certain tokens. This is done using predicates, which let Alex turn token rules on and off depending on what extensions are enabled:
<0> {
"[|" / { ifExtension thEnabled } { token ITopenExpQuote }
}
To add a new syntax extension, add a constructor to ExtBits
and set the bit appropriately in mkPState
.
Tips for development
If you're working on the lexer, it can be useful to have a function lex :: String -> [Token]
. Copying the following into GHCi will do the trick, though it doesn't do error handling:
lex str = case Lexer.lexTokenStream (StringBuffer.stringToStringBuffer str) (SrcLoc.mkRealSrcLoc (FastString.fsLit "") 0 0) DynFlags.unsafeGlobalDynFlags of Lexer.POk _ a -> Prelude.map (\l -> case l of (SrcLoc.L _ e) -> e) a
It is also possible to get a function that takes a list of active extensions:
lex exts str = case Lexer.lexTokenStream (StringBuffer.stringToStringBuffer str) (SrcLoc.mkRealSrcLoc (FastString.fsLit "") 0 0) (Data.List.foldl' DynFlags.xopt_set DynFlags.unsafeGlobalDynFlags exts) of Lexer.POk _ a -> Prelude.map (\l -> case l of (SrcLoc.L _ e) -> e) a