Changes

Karim Taha · fa5b55c9
--- a/GHC-Tolerant-Parser.md
+++ b/GHC-Tolerant-Parser.md
+---
+Title:       GHC Tolerant Parser
+Subtitle:    'GHC Tolerant Parser: Recovery and Representation'
+Project:     Parse error recovery and incrementality for GHC
+Author:      Karim Taha
+Affiliation: GSoC 2024, summer.haskell.org
+Date:        Nov 1, 2024
+---
+
+# GHC Tolerant Parser: Recovery and Representation
+
+## Introduction
+
+Let's assume you are writing a haskell program you forgot to put the closing ')'
+of a tuple,  had the intention to declare  a type and forgot the  type name, and
+forgot the  expression to  match on in  a case statment, something  like the
+following,
+
+```haskell
+-- file Main.hs
+module Main where
+data
+f :: Show a => a ->
+f a = case of () -> (a, a
+main = print $ f ()
+```
+
+If you try to compile this code, and you should, you would be struck by the
+following errors, actualy one error,
+
+```shell
+$ ghc Main.hs
+> [1 of 1] Compiling Main             ( Main.hs, Main.o )
+>
+> Main.hs:3:1: error:
+>     parse error (possibly incorrect indentation or mismatched brackets)
+>   |
+> 3 | f :: Show a=> a ->
+>   | ^
+```
+
+If you are  used to ghc, you might  think, **so what**? Well, this  is clearly a
+parsing error, but there are several  parsing errors as well, the missing return
+type for `f`  and the missing expression  in the `case` statement,  all of which
+got ignored by the compiler, which reported the first one and returned with exit
+code 1.
+
+The parser  errors because after  `data` an `identifier`  is expected not  a `;`
+(which got  inserted by  the lexer,  according to the  layout rule,  just before
+`f`), thus upon seeing  `;`, it fails to produce output  and hence the compilation
+pipeline fails.
+
+Furthermore, the error  message itself says no information, it  only reports the
+location of  the error,  with a  wrong and irrelevant  diagnostic, it  should at
+least mention any thing like 'expected and identifier'.
+
+This  in-tolerant parser  makes it  frustrating to  do simple  edit-compile-test
+cycles, in  the scenario  above, the  user would have  to issue  the compilation
+command three times in order to catch  all the parsing errors.
+
+This becomes more  apparent with source analysis tools, running the above example
+through `hls`, we have
+
+![GHC 9,10.1/HLS 2.9.0.1](hls-original.gif)
+
+All the code  actions stopped, only an underline  at the end of the  file with a
+useless diagnostic.
+
+## A more tolerant/informative Parser
+
+To have  better error  messages, and more  comprehensive error  reporting (where
+errors from all the  pipeline stages are reported in one go),  the parser has to
+output a  **partial ast** -- when  it sees  something like `tup  = (1, 2`,  it can
+*recover* from the missing closing `)` and continue parsing instead of panicking
+out.  The resultant  **partial ast**  is  then passed  to rest  of the  compiler
+pipeline.
+
+**Tolerant  Parsers**  require  an  error   recovery  mechanism  and  ann  error
+representation in the output ast.
+
+
+### Happy's new error recovery mechanisms
+Happy 2.1 comes with new features,  namely the error recovery mechanism. It uses
+a  synchronizatin  token, the  `catch`  token.  When  the parser  encounters  an
+unexpected token,  it inserts a  `catch` token which  tells the parser  where to
+resume parsing.
+
+The  grammar author  has  to handle  `catch` tokens,  otherwise,  it's a  syntax
+error. This is called **error resumption mode**.
+
+`catch` mechansim  requires two functions to  be specified, the first  is called
+when it recovers from error, the  `report` continuation, otherwise, it calls the
+`abort` function.
+
+Upon encountring an unexpected token, the  parser collects prefixes of the state
+stack that  can shift  `catch` and  shifts it  on every  stack, each  new stack,
+`prefix + shifted catch`, is called a catch-frame.
+
+Then starting  from the longest  catch-frame, the innermost one,  discard tokens
+from the input until a token can be shifted, otherwise try the next catch-frame,
+if the end-of-input is reached and no catch-frame resumed parsing, call `abort`.
+
+For more information on how to specify `report` and `abort` functions, and how to
+use `catch`, see Happy's [documentation](https://haskell-happy.readthedocs.io/en/latest/using.html#resumptive-parsing-with-catch).
+
+### GHC's new parser
+Now  that Happy  got  the new  error  recovery mechanism,  GHC's  parser has  to
+incorporate the  usage of  `catch`. we  define one  catch non-terminal  for each
+syntax category,  `catch_HsDecl`, `catch_HsType`,  `catch_HsExpr`, note  that we
+didn't define one for patterns because they are constructed from expressions.
+
+The next  step is to determine  what actions correspond to  these non-terminals.
+Since the result of a parser is an  ast, we are expected to construct nodes that
+represent parsing errors, they are generated by `catch` in the end.
+
+To represent errors in the syntax tree, we tried not to add new constructors to
+L.H.S, Since error nodes are specific to the compiler, instead, for each syntax
+category, we used what correponded to a `hole`, for example,`HsUnboundVar`
+represent bad expressions.
+
+The error representation  is distinctive, it can only appear  for an error node,
+this is important, since we use the same constructors we use for correct syntax,
+it ensures no false positives.
+
+Next step, and most important one, where to insert `catch_*` in the grammar?
+Well, we insert `catch` where we want the parser to resume parsing in case of
+error, so for example, taking a look at `parser.y`, we see
+
+```haskell
+
+catch_loc :: { SrcSpan }
+         : catch  {% do { ps <- getPrevLoc
+                        ; return (mkSrcSpanPs ps) } }
+
+catch_missing_tok(TOK) :: { Located Token }
+        : TOK { $1 }
+        | catch_loc { L $1 undefined }
+
+aexp2   :: { ECP }
+        ...
+        ...
+        | '(' tup_exprs catch_missing_tok(')')             { ECP $
+
+```
+
+We want to  resume parsing we miss  a closing `)` at  the end of a  tuple, so we
+change the last symbol to a union  between `catch` and `)`, done through Happy's
+parametric rules.
+
+This seems  easy, whenever we  encounter a parse  error, we trace  Happy's state
+machine (outputable with parser built with `-d` flag passed to `happy` command),
+then on the state  that rejects the unexpected token, we add  a `catch` token in
+the corresponding rule in grammar, right? at least this how I did it.
+
+It's not that  easy, since `catch` is  always shifted, and it can't  appear as a
+lookahead token,  we can't get a  shift/reduce conflict, however, we  can easily
+get  a reduce/reduce  conflict, for  example, try  to add  a new  `catch_HsExpr`
+alternative for `infixexp` rule and you will get a reduce/reduce conflict.
+
+The  conflict  is   easy  to  reason  about,  since   `infixexp`,  through  left
+expansion/substitution, reaches a state where  `aexp2` is the leftmost symbol in
+the production, and since `aexp2`  itself has a `catch_HsExpr` alternative, this
+means there will be a state with the following itemset,
+
+``` text
+state a:
+     infixexp -> . aexp2 <the rest>
+     infixexp -> . catch_HsExpr
+     aexp2    -> . catch_HsExpr
+
+-- assume it reduced a catch_HsExpr
+
+state b:
+     infixexp -> catch_HsExpr .
+     aexp2    -> catch_HsExpr .
+```
+That's a reduce/reduce conflict.
+
+We prefer `aexp2` over `infixexp` since it's the inner most production reachable
+from infixexp through expansion of the leftmost symbol, so it's occurence would
+be lower in the tree than it would if it were an `infixexp` alternative. This
+ensures more correct info is kept.
+
+How to insert `catch` while  avoiding reduce/reduce conflicts?  After a leftmost
+expansion, no two  rules, where one is  reachable from the other,  should have a
+common prefix ending  with a `catch`.  This is not  necessarily exhaustive, it's
+the heuristic used in GHC's parser.
+
+The only exception is for the following productions,
+``` haskell
+catch_HsDecl :: { LHsDecl GhcPs }
+             : catch_loc
+
+catch_HsType :: { LHsType GhcPs }
+             : catch_loc
+
+catch_HsExpr :: { ECP }
+             : catch_loc %shift
+
+topdecl :: { LHsDecl GhcPs }
+        ...
+        ...
+        | infixexp
+        | catch_HsDecl
+
+infixexp :: { ECP }
+         ...
+         ...
+         | catch_HsExpr -- through aexp2
+```
+
+Without `%shift` annotation in `catch_HsExpr`, this is a reduce/reduce conflict. The problem
+is that a `topdecl` can be an `infixexp`, these are template haskell expressions.
+
+The two offending rules are
+```haskell
+decl_no_th :: { LHsDecl GhcPs }
+           ...
+           ...
+           | infixexp opt_sig rhs  -- this rule is responsible for variable binding in class definitions
+                                   -- infixexp -> aexp2 -> catch_HsExpr
+
+decl    :: { LHsDecl GhcPs }
+        ...
+        ...
+        | decl_no_th
+        | catch_HsDecl
+```
+
+Why make `catch_HsDecl` win?
+consider the following example,
+
+``` haskell
+module Lib where
+
+class C a where
+  x :: a
+  f :: a -> Int
+  1
+
+z = 1
+```
+
+There is a syntax  error in `1`, it gets parsed as a  bad declaration, if we had
+let `catch_HsExpr`  win, we  would fail  to shift  `opt_sig` followed  by `rhs`,
+since we can't shift input tokens, we discard states until the next catch frame,
+this reaches a state corresponding to the following input
+
+```haskell
+module Lib where
+
+class
+z = 1
+```
+
+This can  shift `catch`  which is  reduced to  `catch_HsType` resulting  in `bad
+class`,however the class' definition is  itself complete, an empty `where`, then
+we can shift another top level declaratin`z = 1`.
+
+## Results
+To test ghc's changes, we will opt for using `-ddump-parsed` flag, since `hls` doesn't
+support `ghc-master` yet,
+
+For the program in the introduction,
+
+<details>
+<summary> `ghc-9.10.1` output </summary>
+
+```shell
+$ ghc Main.hs -ddump-parsed
+> ghc -ddump-parsed test.hs
+> [1 of 2] Compiling Main             ( test.hs, test.o )
+>
+> test.hs:3:1: error: [GHC-58481]
+>     parse error (possibly incorrect indentation or mismatched brackets)
+>   |
+> 3 | f :: Show a => a ->
+>   | ^
+```
+The parser gives no output.
+</details>
+
+
+<details>
+<summary> `ghc-master` output </summary>
+
+```shell
+$ _build/stage1/bin/ghc -ddump-parsed test.hs
+> [1 of 2] Compiling Main             ( test.hs, test.o )
+>
+> ==================== Parser ====================
+> module Main where
+> data <Bad Type>
+> f :: Show a => a -> <Bad Type>
+> f a = case <Bad Expression> of () -> (a, a)
+> main = print $ f ()
+>
+>
+> test.hs:3:18: error: [GHC-88464]
+>     • Found type wildcard ‘_’ standing for ‘(a, a)’
+>       Where: ‘a’ is a rigid type variable bound by
+>                the inferred type of f :: Show a => a -> (a, a)
+>                at test.hs:3:1-19
+>       To use the inferred type, enable PartialTypeSignatures
+>     • In the type signature: f :: Show a => a -> <Bad Type>
+>   |
+> 3 | f :: Show a => a ->
+>   |                  ^^
+>
+> test.hs:4:12: error: [GHC-88464]
+>     ill-formed expression <Bad Expression> :: ()
+>   |
+> 4 | f a = case of () -> (a, a
+>   |            ^^
+
+```
+
+We can see the `hole` thingy, mentioned above in relation to error representation, in the
+type error reported for `f :: Show a => a -> <Bad Type>`.
+
+</details>
+
+-------------------------------------------------------------------------------
+
+<details>
+<summary> Another example </summary>
+
+```haskell
+module Lib where
+
+f (let x = 1 in y) = 10
+f (case () of _ -> 10) = 10
+
+
+
+data asd
+data -
+i = + parseErrorHere
+
+data C = data (Int, data)
+       | in Int
+
+class (data, Show a) => data a where
+  x:
+
+case
+y = let case
+arith = 1 +
+ls    = [data ..]
+tup   = (0, data, 1, data, 2, data)
+lc    = [data | in]
+
+
+localdef =
+  where {
+    ; = 2
+    ; x = 1
+    }
+
+```
+```shell
+$ _build/stage1/bin/ghc -V && _build/stage1/bin/ghc -ddump-parsed test.hs
+> The Glorious Glasgow Haskell Compilation System, version 9.13.20241026
+> [1 of 1] Compiling Lib              ( test.hs, test.o )
+>
+> ==================== Parser ====================
+> module Lib where
+> f (<Bad Pattern>) = 10
+> f (<Bad Pattern>) = 10
+> data asd
+> data <Bad Type>
+> i = <Bad Expression> + parseErrorHere
+> data C = <Bad DataCon> (Int, <Bad Type>) | <Bad DataCon> Int
+> class (<Bad Type>, Show a) => <Bad Class> a where
+>   <Bad Pattern> = <Bad Expression>
+> <Bad Pattern> = <Bad Expression>
+> y = <Bad Expression>
+> arith = 1 + <Bad Expression>
+> ls = [<Bad Expression> .. ]
+> tup
+>   = (0, <Bad Expression>, 1, <Bad Expression>, 2, <Bad Expression>)
+> lc = [<Bad Expression> | <Bad Expression>]
+> localdef
+>   = <Bad Expression>
+>   where
+>       <Bad Pattern> = 2
+>       x = 1
+>
+>
+> test.hs:13:13: error: [GHC-29916]
+>     Multiple declarations of ‘<Bad DataCon>’
+>     Declared at: test.hs:12:15
+>                  test.hs:13:13
+>    |
+> 13 |        | in Int
+>    |             ^^^
+>
+```
+</details>
+
+## Shortcomings
+The most visible one is the interaction between the `error` token, needed by the layout rule, and `catch`, take for example
+```haskell
+module Lib where
+
+x = f data
+```
+It's expected that this would be parsed as
+``` haskell
+module Lib where
+x = f <Bad Expression>
+```
+
+However, if we wrote the equivalent explicitly layouted code, we get
+```haskell
+module Lib where {
+x = f
+}
+data
+```
+
+When the parser sees `data`, which is  an unexpected token, it shifts an `error`
+token, before `catch`  (it actually tries `error` before `catch`  in every rule,
+the difference  is that  the current  state of the  parser can  handle `error`),
+which gets reduced to `vccurly`,
+
+```shell
+$ _build/stage1/bin/ghc -ddump-parsed test.hs
+> [1 of 1] Compiling Lib              ( test.hs, test.o )
+>
+> ==================== Parser ====================
+> module Lib where
+> x = f
+>
+>
+> test.hs:3:5: error: [GHC-88464] Variable not in scope: f
+>   |
+> 3 | x = f data
+>   |     ^
+```
+
+It  ignored  code  outside  the  curly  brances, due  to  the  rule  `module  ->
+module_no_catch catch`, and then reported an unbound variable error.
+
+Other  limitations include  cases where  there is  no `catch`  yet, things  like
+import declarations and  export list. This is work in  progress. We started with
+basic syntactical categories first.
+
+## Future Work
+- Our todo list has:
+  - add missing `catch` rules.
+  - add error representations for `HsCmd` node and several other nodes.
+  - eliminate all usages of `addFatalError` from PostProcess.hs.
+  - write  better error  messages for  cases when  the `hole`  related message
+    makes no sense.
+  - improve  the `exact-printing` annotations,  to print the user  input, even
+    when it  has syntax errors, along  with the error representation,  this is
+    necessary for pretty printers, where input length affects the output.
+  - Improve error reporting, the current implementation hides parsing errors, when
+    reporting errors from later stages.
+
+
+## Our work and current progress
+- Happy: (mostly work by Sebastian Graf, my contribution here was refactoring and preparing patches)
+  - <https://github.com/haskell/happy/commit/68ffbeeb607c497dfd0603c7ce2f3cff1a1895c8>
+  - <https://github.com/haskell/happy/commit/80126af45fb7fa11b9a21cdee4a93145dacfe8af>
+  - <https://github.com/haskell/happy/commit/fe480e3d905237ff7e87ac862425984c3745374c>
+  - <https://github.com/haskell/happy/commit/ed633c4ce2fa9fa0fdd6ac66e0c7a55aa1e03d59>
+  - <https://github.com/haskell/happy/commit/d9b9980767527ac7f68c88d2d764f9a4b02504b4>
+- GHC:
+  - <https://gitlab.haskell.org/ghc/ghc/-/merge_requests/13145>