|
|
|
---
|
|
|
|
Title: GHC Tolerant Parser
|
|
|
|
Subtitle: 'GHC Tolerant Parser: Recovery and Representation'
|
|
|
|
Project: Parse error recovery and incrementality for GHC
|
|
|
|
Author: Karim Taha
|
|
|
|
Affiliation: GSoC 2024, summer.haskell.org
|
|
|
|
Date: Nov 1, 2024
|
|
|
|
---
|
|
|
|
|
|
|
|
# GHC Tolerant Parser: Recovery and Representation
|
|
|
|
|
|
|
|
## Introduction
|
|
|
|
|
|
|
|
Let's assume you are writing a haskell program you forgot to put the closing ')'
|
|
|
|
of a tuple, had the intention to declare a type and forgot the type name, and
|
|
|
|
forgot the expression to match on in a case statment, something like the
|
|
|
|
following,
|
|
|
|
|
|
|
|
```haskell
|
|
|
|
-- file Main.hs
|
|
|
|
module Main where
|
|
|
|
data
|
|
|
|
f :: Show a => a ->
|
|
|
|
f a = case of () -> (a, a
|
|
|
|
main = print $ f ()
|
|
|
|
```
|
|
|
|
|
|
|
|
If you try to compile this code, and you should, you would be struck by the
|
|
|
|
following errors, actualy one error,
|
|
|
|
|
|
|
|
```shell
|
|
|
|
$ ghc Main.hs
|
|
|
|
> [1 of 1] Compiling Main ( Main.hs, Main.o )
|
|
|
|
>
|
|
|
|
> Main.hs:3:1: error:
|
|
|
|
> parse error (possibly incorrect indentation or mismatched brackets)
|
|
|
|
> |
|
|
|
|
> 3 | f :: Show a=> a ->
|
|
|
|
> | ^
|
|
|
|
```
|
|
|
|
|
|
|
|
If you are used to ghc, you might think, **so what**? Well, this is clearly a
|
|
|
|
parsing error, but there are several parsing errors as well, the missing return
|
|
|
|
type for `f` and the missing expression in the `case` statement, all of which
|
|
|
|
got ignored by the compiler, which reported the first one and returned with exit
|
|
|
|
code 1.
|
|
|
|
|
|
|
|
The parser errors because after `data` an `identifier` is expected not a `;`
|
|
|
|
(which got inserted by the lexer, according to the layout rule, just before
|
|
|
|
`f`), thus upon seeing `;`, it fails to produce output and hence the compilation
|
|
|
|
pipeline fails.
|
|
|
|
|
|
|
|
Furthermore, the error message itself says no information, it only reports the
|
|
|
|
location of the error, with a wrong and irrelevant diagnostic, it should at
|
|
|
|
least mention any thing like 'expected and identifier'.
|
|
|
|
|
|
|
|
This in-tolerant parser makes it frustrating to do simple edit-compile-test
|
|
|
|
cycles, in the scenario above, the user would have to issue the compilation
|
|
|
|
command three times in order to catch all the parsing errors.
|
|
|
|
|
|
|
|
This becomes more apparent with source analysis tools, running the above example
|
|
|
|
through `hls`, we have
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
All the code actions stopped, only an underline at the end of the file with a
|
|
|
|
useless diagnostic.
|
|
|
|
|
|
|
|
## A more tolerant/informative Parser
|
|
|
|
|
|
|
|
To have better error messages, and more comprehensive error reporting (where
|
|
|
|
errors from all the pipeline stages are reported in one go), the parser has to
|
|
|
|
output a **partial ast** -- when it sees something like `tup = (1, 2`, it can
|
|
|
|
*recover* from the missing closing `)` and continue parsing instead of panicking
|
|
|
|
out. The resultant **partial ast** is then passed to rest of the compiler
|
|
|
|
pipeline.
|
|
|
|
|
|
|
|
**Tolerant Parsers** require an error recovery mechanism and ann error
|
|
|
|
representation in the output ast.
|
|
|
|
|
|
|
|
|
|
|
|
### Happy's new error recovery mechanisms
|
|
|
|
Happy 2.1 comes with new features, namely the error recovery mechanism. It uses
|
|
|
|
a synchronizatin token, the `catch` token. When the parser encounters an
|
|
|
|
unexpected token, it inserts a `catch` token which tells the parser where to
|
|
|
|
resume parsing.
|
|
|
|
|
|
|
|
The grammar author has to handle `catch` tokens, otherwise, it's a syntax
|
|
|
|
error. This is called **error resumption mode**.
|
|
|
|
|
|
|
|
`catch` mechansim requires two functions to be specified, the first is called
|
|
|
|
when it recovers from error, the `report` continuation, otherwise, it calls the
|
|
|
|
`abort` function.
|
|
|
|
|
|
|
|
Upon encountring an unexpected token, the parser collects prefixes of the state
|
|
|
|
stack that can shift `catch` and shifts it on every stack, each new stack,
|
|
|
|
`prefix + shifted catch`, is called a catch-frame.
|
|
|
|
|
|
|
|
Then starting from the longest catch-frame, the innermost one, discard tokens
|
|
|
|
from the input until a token can be shifted, otherwise try the next catch-frame,
|
|
|
|
if the end-of-input is reached and no catch-frame resumed parsing, call `abort`.
|
|
|
|
|
|
|
|
For more information on how to specify `report` and `abort` functions, and how to
|
|
|
|
use `catch`, see Happy's [documentation](https://haskell-happy.readthedocs.io/en/latest/using.html#resumptive-parsing-with-catch).
|
|
|
|
|
|
|
|
### GHC's new parser
|
|
|
|
Now that Happy got the new error recovery mechanism, GHC's parser has to
|
|
|
|
incorporate the usage of `catch`. we define one catch non-terminal for each
|
|
|
|
syntax category, `catch_HsDecl`, `catch_HsType`, `catch_HsExpr`, note that we
|
|
|
|
didn't define one for patterns because they are constructed from expressions.
|
|
|
|
|
|
|
|
The next step is to determine what actions correspond to these non-terminals.
|
|
|
|
Since the result of a parser is an ast, we are expected to construct nodes that
|
|
|
|
represent parsing errors, they are generated by `catch` in the end.
|
|
|
|
|
|
|
|
To represent errors in the syntax tree, we tried not to add new constructors to
|
|
|
|
L.H.S, Since error nodes are specific to the compiler, instead, for each syntax
|
|
|
|
category, we used what correponded to a `hole`, for example,`HsUnboundVar`
|
|
|
|
represent bad expressions.
|
|
|
|
|
|
|
|
The error representation is distinctive, it can only appear for an error node,
|
|
|
|
this is important, since we use the same constructors we use for correct syntax,
|
|
|
|
it ensures no false positives.
|
|
|
|
|
|
|
|
Next step, and most important one, where to insert `catch_*` in the grammar?
|
|
|
|
Well, we insert `catch` where we want the parser to resume parsing in case of
|
|
|
|
error, so for example, taking a look at `parser.y`, we see
|
|
|
|
|
|
|
|
```haskell
|
|
|
|
|
|
|
|
catch_loc :: { SrcSpan }
|
|
|
|
: catch {% do { ps <- getPrevLoc
|
|
|
|
; return (mkSrcSpanPs ps) } }
|
|
|
|
|
|
|
|
catch_missing_tok(TOK) :: { Located Token }
|
|
|
|
: TOK { $1 }
|
|
|
|
| catch_loc { L $1 undefined }
|
|
|
|
|
|
|
|
aexp2 :: { ECP }
|
|
|
|
...
|
|
|
|
...
|
|
|
|
| '(' tup_exprs catch_missing_tok(')') { ECP $
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
We want to resume parsing we miss a closing `)` at the end of a tuple, so we
|
|
|
|
change the last symbol to a union between `catch` and `)`, done through Happy's
|
|
|
|
parametric rules.
|
|
|
|
|
|
|
|
This seems easy, whenever we encounter a parse error, we trace Happy's state
|
|
|
|
machine (outputable with parser built with `-d` flag passed to `happy` command),
|
|
|
|
then on the state that rejects the unexpected token, we add a `catch` token in
|
|
|
|
the corresponding rule in grammar, right? at least this how I did it.
|
|
|
|
|
|
|
|
It's not that easy, since `catch` is always shifted, and it can't appear as a
|
|
|
|
lookahead token, we can't get a shift/reduce conflict, however, we can easily
|
|
|
|
get a reduce/reduce conflict, for example, try to add a new `catch_HsExpr`
|
|
|
|
alternative for `infixexp` rule and you will get a reduce/reduce conflict.
|
|
|
|
|
|
|
|
The conflict is easy to reason about, since `infixexp`, through left
|
|
|
|
expansion/substitution, reaches a state where `aexp2` is the leftmost symbol in
|
|
|
|
the production, and since `aexp2` itself has a `catch_HsExpr` alternative, this
|
|
|
|
means there will be a state with the following itemset,
|
|
|
|
|
|
|
|
``` text
|
|
|
|
state a:
|
|
|
|
infixexp -> . aexp2 <the rest>
|
|
|
|
infixexp -> . catch_HsExpr
|
|
|
|
aexp2 -> . catch_HsExpr
|
|
|
|
|
|
|
|
-- assume it reduced a catch_HsExpr
|
|
|
|
|
|
|
|
state b:
|
|
|
|
infixexp -> catch_HsExpr .
|
|
|
|
aexp2 -> catch_HsExpr .
|
|
|
|
```
|
|
|
|
That's a reduce/reduce conflict.
|
|
|
|
|
|
|
|
We prefer `aexp2` over `infixexp` since it's the inner most production reachable
|
|
|
|
from infixexp through expansion of the leftmost symbol, so it's occurence would
|
|
|
|
be lower in the tree than it would if it were an `infixexp` alternative. This
|
|
|
|
ensures more correct info is kept.
|
|
|
|
|
|
|
|
How to insert `catch` while avoiding reduce/reduce conflicts? After a leftmost
|
|
|
|
expansion, no two rules, where one is reachable from the other, should have a
|
|
|
|
common prefix ending with a `catch`. This is not necessarily exhaustive, it's
|
|
|
|
the heuristic used in GHC's parser.
|
|
|
|
|
|
|
|
The only exception is for the following productions,
|
|
|
|
``` haskell
|
|
|
|
catch_HsDecl :: { LHsDecl GhcPs }
|
|
|
|
: catch_loc
|
|
|
|
|
|
|
|
catch_HsType :: { LHsType GhcPs }
|
|
|
|
: catch_loc
|
|
|
|
|
|
|
|
catch_HsExpr :: { ECP }
|
|
|
|
: catch_loc %shift
|
|
|
|
|
|
|
|
topdecl :: { LHsDecl GhcPs }
|
|
|
|
...
|
|
|
|
...
|
|
|
|
| infixexp
|
|
|
|
| catch_HsDecl
|
|
|
|
|
|
|
|
infixexp :: { ECP }
|
|
|
|
...
|
|
|
|
...
|
|
|
|
| catch_HsExpr -- through aexp2
|
|
|
|
```
|
|
|
|
|
|
|
|
Without `%shift` annotation in `catch_HsExpr`, this is a reduce/reduce conflict. The problem
|
|
|
|
is that a `topdecl` can be an `infixexp`, these are template haskell expressions.
|
|
|
|
|
|
|
|
The two offending rules are
|
|
|
|
```haskell
|
|
|
|
decl_no_th :: { LHsDecl GhcPs }
|
|
|
|
...
|
|
|
|
...
|
|
|
|
| infixexp opt_sig rhs -- this rule is responsible for variable binding in class definitions
|
|
|
|
-- infixexp -> aexp2 -> catch_HsExpr
|
|
|
|
|
|
|
|
decl :: { LHsDecl GhcPs }
|
|
|
|
...
|
|
|
|
...
|
|
|
|
| decl_no_th
|
|
|
|
| catch_HsDecl
|
|
|
|
```
|
|
|
|
|
|
|
|
Why make `catch_HsDecl` win?
|
|
|
|
consider the following example,
|
|
|
|
|
|
|
|
``` haskell
|
|
|
|
module Lib where
|
|
|
|
|
|
|
|
class C a where
|
|
|
|
x :: a
|
|
|
|
f :: a -> Int
|
|
|
|
1
|
|
|
|
|
|
|
|
z = 1
|
|
|
|
```
|
|
|
|
|
|
|
|
There is a syntax error in `1`, it gets parsed as a bad declaration, if we had
|
|
|
|
let `catch_HsExpr` win, we would fail to shift `opt_sig` followed by `rhs`,
|
|
|
|
since we can't shift input tokens, we discard states until the next catch frame,
|
|
|
|
this reaches a state corresponding to the following input
|
|
|
|
|
|
|
|
```haskell
|
|
|
|
module Lib where
|
|
|
|
|
|
|
|
class
|
|
|
|
z = 1
|
|
|
|
```
|
|
|
|
|
|
|
|
This can shift `catch` which is reduced to `catch_HsType` resulting in `bad
|
|
|
|
class`,however the class' definition is itself complete, an empty `where`, then
|
|
|
|
we can shift another top level declaratin`z = 1`.
|
|
|
|
|
|
|
|
## Results
|
|
|
|
To test ghc's changes, we will opt for using `-ddump-parsed` flag, since `hls` doesn't
|
|
|
|
support `ghc-master` yet,
|
|
|
|
|
|
|
|
For the program in the introduction,
|
|
|
|
|
|
|
|
<details>
|
|
|
|
<summary> `ghc-9.10.1` output </summary>
|
|
|
|
|
|
|
|
```shell
|
|
|
|
$ ghc Main.hs -ddump-parsed
|
|
|
|
> ghc -ddump-parsed test.hs
|
|
|
|
> [1 of 2] Compiling Main ( test.hs, test.o )
|
|
|
|
>
|
|
|
|
> test.hs:3:1: error: [GHC-58481]
|
|
|
|
> parse error (possibly incorrect indentation or mismatched brackets)
|
|
|
|
> |
|
|
|
|
> 3 | f :: Show a => a ->
|
|
|
|
> | ^
|
|
|
|
```
|
|
|
|
The parser gives no output.
|
|
|
|
</details>
|
|
|
|
|
|
|
|
|
|
|
|
<details>
|
|
|
|
<summary> `ghc-master` output </summary>
|
|
|
|
|
|
|
|
```shell
|
|
|
|
$ _build/stage1/bin/ghc -ddump-parsed test.hs
|
|
|
|
> [1 of 2] Compiling Main ( test.hs, test.o )
|
|
|
|
>
|
|
|
|
> ==================== Parser ====================
|
|
|
|
> module Main where
|
|
|
|
> data <Bad Type>
|
|
|
|
> f :: Show a => a -> <Bad Type>
|
|
|
|
> f a = case <Bad Expression> of () -> (a, a)
|
|
|
|
> main = print $ f ()
|
|
|
|
>
|
|
|
|
>
|
|
|
|
> test.hs:3:18: error: [GHC-88464]
|
|
|
|
> • Found type wildcard ‘_’ standing for ‘(a, a)’
|
|
|
|
> Where: ‘a’ is a rigid type variable bound by
|
|
|
|
> the inferred type of f :: Show a => a -> (a, a)
|
|
|
|
> at test.hs:3:1-19
|
|
|
|
> To use the inferred type, enable PartialTypeSignatures
|
|
|
|
> • In the type signature: f :: Show a => a -> <Bad Type>
|
|
|
|
> |
|
|
|
|
> 3 | f :: Show a => a ->
|
|
|
|
> | ^^
|
|
|
|
>
|
|
|
|
> test.hs:4:12: error: [GHC-88464]
|
|
|
|
> ill-formed expression <Bad Expression> :: ()
|
|
|
|
> |
|
|
|
|
> 4 | f a = case of () -> (a, a
|
|
|
|
> | ^^
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
We can see the `hole` thingy, mentioned above in relation to error representation, in the
|
|
|
|
type error reported for `f :: Show a => a -> <Bad Type>`.
|
|
|
|
|
|
|
|
</details>
|
|
|
|
|
|
|
|
-------------------------------------------------------------------------------
|
|
|
|
|
|
|
|
<details>
|
|
|
|
<summary> Another example </summary>
|
|
|
|
|
|
|
|
```haskell
|
|
|
|
module Lib where
|
|
|
|
|
|
|
|
f (let x = 1 in y) = 10
|
|
|
|
f (case () of _ -> 10) = 10
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
data asd
|
|
|
|
data -
|
|
|
|
i = + parseErrorHere
|
|
|
|
|
|
|
|
data C = data (Int, data)
|
|
|
|
| in Int
|
|
|
|
|
|
|
|
class (data, Show a) => data a where
|
|
|
|
x:
|
|
|
|
|
|
|
|
case
|
|
|
|
y = let case
|
|
|
|
arith = 1 +
|
|
|
|
ls = [data ..]
|
|
|
|
tup = (0, data, 1, data, 2, data)
|
|
|
|
lc = [data | in]
|
|
|
|
|
|
|
|
|
|
|
|
localdef =
|
|
|
|
where {
|
|
|
|
; = 2
|
|
|
|
; x = 1
|
|
|
|
}
|
|
|
|
|
|
|
|
```
|
|
|
|
```shell
|
|
|
|
$ _build/stage1/bin/ghc -V && _build/stage1/bin/ghc -ddump-parsed test.hs
|
|
|
|
> The Glorious Glasgow Haskell Compilation System, version 9.13.20241026
|
|
|
|
> [1 of 1] Compiling Lib ( test.hs, test.o )
|
|
|
|
>
|
|
|
|
> ==================== Parser ====================
|
|
|
|
> module Lib where
|
|
|
|
> f (<Bad Pattern>) = 10
|
|
|
|
> f (<Bad Pattern>) = 10
|
|
|
|
> data asd
|
|
|
|
> data <Bad Type>
|
|
|
|
> i = <Bad Expression> + parseErrorHere
|
|
|
|
> data C = <Bad DataCon> (Int, <Bad Type>) | <Bad DataCon> Int
|
|
|
|
> class (<Bad Type>, Show a) => <Bad Class> a where
|
|
|
|
> <Bad Pattern> = <Bad Expression>
|
|
|
|
> <Bad Pattern> = <Bad Expression>
|
|
|
|
> y = <Bad Expression>
|
|
|
|
> arith = 1 + <Bad Expression>
|
|
|
|
> ls = [<Bad Expression> .. ]
|
|
|
|
> tup
|
|
|
|
> = (0, <Bad Expression>, 1, <Bad Expression>, 2, <Bad Expression>)
|
|
|
|
> lc = [<Bad Expression> | <Bad Expression>]
|
|
|
|
> localdef
|
|
|
|
> = <Bad Expression>
|
|
|
|
> where
|
|
|
|
> <Bad Pattern> = 2
|
|
|
|
> x = 1
|
|
|
|
>
|
|
|
|
>
|
|
|
|
> test.hs:13:13: error: [GHC-29916]
|
|
|
|
> Multiple declarations of ‘<Bad DataCon>’
|
|
|
|
> Declared at: test.hs:12:15
|
|
|
|
> test.hs:13:13
|
|
|
|
> |
|
|
|
|
> 13 | | in Int
|
|
|
|
> | ^^^
|
|
|
|
>
|
|
|
|
```
|
|
|
|
</details>
|
|
|
|
|
|
|
|
## Shortcomings
|
|
|
|
The most visible one is the interaction between the `error` token, needed by the layout rule, and `catch`, take for example
|
|
|
|
```haskell
|
|
|
|
module Lib where
|
|
|
|
|
|
|
|
x = f data
|
|
|
|
```
|
|
|
|
It's expected that this would be parsed as
|
|
|
|
``` haskell
|
|
|
|
module Lib where
|
|
|
|
x = f <Bad Expression>
|
|
|
|
```
|
|
|
|
|
|
|
|
However, if we wrote the equivalent explicitly layouted code, we get
|
|
|
|
```haskell
|
|
|
|
module Lib where {
|
|
|
|
x = f
|
|
|
|
}
|
|
|
|
data
|
|
|
|
```
|
|
|
|
|
|
|
|
When the parser sees `data`, which is an unexpected token, it shifts an `error`
|
|
|
|
token, before `catch` (it actually tries `error` before `catch` in every rule,
|
|
|
|
the difference is that the current state of the parser can handle `error`),
|
|
|
|
which gets reduced to `vccurly`,
|
|
|
|
|
|
|
|
```shell
|
|
|
|
$ _build/stage1/bin/ghc -ddump-parsed test.hs
|
|
|
|
> [1 of 1] Compiling Lib ( test.hs, test.o )
|
|
|
|
>
|
|
|
|
> ==================== Parser ====================
|
|
|
|
> module Lib where
|
|
|
|
> x = f
|
|
|
|
>
|
|
|
|
>
|
|
|
|
> test.hs:3:5: error: [GHC-88464] Variable not in scope: f
|
|
|
|
> |
|
|
|
|
> 3 | x = f data
|
|
|
|
> | ^
|
|
|
|
```
|
|
|
|
|
|
|
|
It ignored code outside the curly brances, due to the rule `module ->
|
|
|
|
module_no_catch catch`, and then reported an unbound variable error.
|
|
|
|
|
|
|
|
Other limitations include cases where there is no `catch` yet, things like
|
|
|
|
import declarations and export list. This is work in progress. We started with
|
|
|
|
basic syntactical categories first.
|
|
|
|
|
|
|
|
## Future Work
|
|
|
|
- Our todo list has:
|
|
|
|
- add missing `catch` rules.
|
|
|
|
- add error representations for `HsCmd` node and several other nodes.
|
|
|
|
- eliminate all usages of `addFatalError` from PostProcess.hs.
|
|
|
|
- write better error messages for cases when the `hole` related message
|
|
|
|
makes no sense.
|
|
|
|
- improve the `exact-printing` annotations, to print the user input, even
|
|
|
|
when it has syntax errors, along with the error representation, this is
|
|
|
|
necessary for pretty printers, where input length affects the output.
|
|
|
|
- Improve error reporting, the current implementation hides parsing errors, when
|
|
|
|
reporting errors from later stages.
|
|
|
|
|
|
|
|
|
|
|
|
## Our work and current progress
|
|
|
|
- Happy: (mostly work by Sebastian Graf, my contribution here was refactoring and preparing patches)
|
|
|
|
- <https://github.com/haskell/happy/commit/68ffbeeb607c497dfd0603c7ce2f3cff1a1895c8>
|
|
|
|
- <https://github.com/haskell/happy/commit/80126af45fb7fa11b9a21cdee4a93145dacfe8af>
|
|
|
|
- <https://github.com/haskell/happy/commit/fe480e3d905237ff7e87ac862425984c3745374c>
|
|
|
|
- <https://github.com/haskell/happy/commit/ed633c4ce2fa9fa0fdd6ac66e0c7a55aa1e03d59>
|
|
|
|
- <https://github.com/haskell/happy/commit/d9b9980767527ac7f68c88d2d764f9a4b02504b4>
|
|
|
|
- GHC:
|
|
|
|
- <https://gitlab.haskell.org/ghc/ghc/-/merge_requests/13145> |