Alan Zimmerman · 925790ab
--- a/api-annotations.md
+++ b/api-annotations.md
+# This is a decription of the API Annotations introduced with GHC 7.10 RC2
+
+
+The hsSyn AST does not directly capture the locations of certain keywords and
+punctuation, such as 'let', 'in', 'do', etc.
+
+
+These locations are required by any tools wanting to parse a haskell
+file, transform the AST in some way, and then regenerate the original
+layout for the unchaged parts.
+
+
+Rather than pollute the AST with information irrelevant to the actual
+compilation process, these locations are captured in the lexer /
+parser and returned as a separate structure `ApiAnns` structure in the
+`ParsedSource`.
+
+
+This keeps both the keyword location and the original comments with their
+locations.  WIP AZ, change of venue
+
+```
+typeApiAnns=(Map.MapApiAnnKey[SrcSpan],Map.MapSrcSpan[LocatedAnnotationComment])typeApiAnnKey=(SrcSpan,AnnKeywordId)
+```
+
+
+Each AST element that needs an annotation has an entry in this Map,
+which as a key comprising the SrcSpan of the original element and the
+TyeRep of the stored annotation, if it were wrapped in a Just.
+
+
+This allows the annotation to be retrieved by
+
+```
+-- | Retrieve an annotation based on the SrcSpan of the annotated AST-- element, and the known type of the annotation.getAnnotation::ApiAnns->SrcSpan->Ann->MaybeSrcSpangetAnnotation anns span ann =Map.lookup (AK span ann) anns
+```
+
+### Annotation structures
+
+
+Each annotation is simply a `SrcSpan`.
+
+```
+-- | Note: in general the names of these are taken from the-- corresponding token, unless otherwise noteddataAnn=AnnAs|AnnAt|AnnBang|AnnBy|AnnCase|AnnClass|AnnClose-- ^ } or ] or ) or #) etc|AnnColon|AnnColon2..
+```
+
+
+Points to note:
+
+1. `AnnOpen` / `AnnClose` capture all bracketed structures.
+
+1. In the few places where the same annotation type can repeat, multiple indices are provided, hence `AnnColon` and `AnnColon2`.
+
+1. Where a value is being captured via e.g. `getINTEGER` the annotation index is called `AnnVal`.
+
+### Capturing in the parser
+
+
+The annotations are captured in the lexer / parser by extending `PState` to include a field
+
+```
+dataPState=PState{...
+        annotations ::[(ApiAnnKey,SrcSpan)]}
+```
+
+
+The lexer exposes a helper function to add an annotation
+
+```
+addAnnotation::SrcSpan->Ann->SrcSpan->P()addAnnotation l a v =P$\s ->POk s {
+  annotations =((AK l a), v): annotations s
+  }()
+```
+
+
+The parser also has some helper functions of the form
+
+```
+typeMaybeAnn=Maybe(SrcSpan->P())gl= getLoc
+gj x =Just(gl x)aa::Located a ->(Ann,Located c)->P(Located a)aa a@(L l _)(b,s)= addAnnotation l b (gl s)>> return a
+
+ams::Located a ->[MaybeAnn]->P(Located a)ams a@(L l _) bs =(mapM_ (\a -> a l)$ catMaybes bs)>> return a
+```
+
+
+This allows the annotations to be added in the parser productions as follows
+
+```
+ctypedoc::{LHsTypeRdrName}: 'forall' tv_bndrs '.' ctypedoc {% hintExplicitForall (getLoc $1)>>
+                                            ams (LL$ mkExplicitHsForAllTy $2(noLoc [])$4)[mj AnnForall$1,mj AnnDot$3]}| context '=>' ctypedoc         {% ams (LL$ mkQualifiedHsForAllTy   $1$3)[mj AnnDarrow$2]}| ipvar '::'type{% ams (LL(HsIParamTy(unLoc $1)$3))[mj AnnDcolon$2]}| typedoc                       {$1}
+```
+
+### Parse result
+
+```
+dataHsParsedModule=HsParsedModule{
+    hpm_module    ::Located(HsModuleRdrName),
+    hpm_src_files ::[FilePath],-- ^ extra source files (e.g. from #includes).  The lexer collects-- these from '# <file> <line>' pragmas, which the C preprocessor-- leaves behind.  These files and their timestamps are stored in-- the .hi file, so that we can force recompilation if any of-- them change (#3589)
+    hpm_annotations ::ApiAnns}-- | The result of successful parsing.dataParsedModule=ParsedModule{ pm_mod_summary   ::ModSummary, pm_parsed_source ::ParsedSource, pm_extra_src_files ::[FilePath], pm_annotations ::ApiAnns}
+```
+
+### Implications
+
+
+This approach has minimal implications on the rest of GHC, except that some AST elements will require to be `Located` to enable the annotation to be looked up.
+
+
+Also, initial `./validate` tests show that haddock complains of increased memory usage, due to the extra information being captured in the AST. If this becomes a major problem a flag could be introduced when invoking the parser as to whether to actually capture the annotations or not.
+
+## Notes / Shortcomings
+
+
+Currently the annotations are only guaranteed to apply to the `ParsedSource`, the renaming process flattens some of the location hooks for
+
+```
+dataHsType name
+  ...|HsRecTy[Located[ConDeclField name]]-- Only in data type declarations...-- andHsDataDefn{...
+                 dd_cons   ::[Located[LConDecl name]],-- ^ Data constructors...
+```
+
+# Possible Extension for Comments
+
+
+A possible extension to the above annotations would be to add comments as annotations, if enabled by the existing `Opt_KeepRawTokenStream` flag.
+
+
+This would add an additional structure to the `PState` for comments indexed by `SrcSpan`, and also a queue of pending comments.
+
+
+The basic mechanism would be to modify `lexToken` in `Lexer.x` to accumulate the comments in the queue as parsing progressed, and when the helper function `addComments` is called with a `SrcSpan` argument it moves any comments in the queue that fit within the `SrcSpan` into the comment annotation structure indexed by the given `SrcSpan`.
+
+
+These are then available for retrieval at the and, as with the existing annotations.
+
+## Notes on comments
+
+1. In practical terms, the usage of comments by tools requires them to be attached in some way to the logical units to which they apply. A case in point is comments preceding a function. In general, this is a hard problem, which is attacked via heuristics in `hindent` and `HaRe`.  This comment allocation would aim for a mechanistic allocation to the lowest enclosing `SrcSpan`, as a start.
+
+1. It is possible to put the lexer into `Opt_Haddock` mode, a flag which is currently unset when `Opt_KeepRawTokenStream` is enabled. If these were made inclusive, it would be possible to explicitly tag the comments as being of type haddock, so that at a future date the annotations could possibly be used directly by haddock, rather than the very complicated parsing rules currently in place.
+
+# Early design discussion
+
+## Richard Eisenberg response
+
+
+For what it's worth, my thought is not to use SrcSpanInfo (which, to me, is the wrong way to slice the abstraction) but instead to add SrcSpan fields to the relevant nodes. For example:
+
+```
+|HsDoSrcSpan-- of the word "do"BlockSrcSpans(HsStmtContextName)-- The parameterisation is unimportant-- because in this context we never use-- the PatGuard or ParStmt variant[ExprLStmt id]-- "do":one or more stmtsPostTcType-- Type of the whole expression...dataBlockSrcSpans=LayoutBlockInt-- the parameter is the indentation level...-- stuff to track the appearance of any semicolons|BracesBlock...-- stuff to track the braces and semicolons
+```
+
+
+The way I understand it, the SrcSpanInfo proposal means that we would have lots of empty SrcSpanInfos, no? Most interior nodes don't need one, I think.
+
+
+Popping up a level, I do support the idea of including this info in the AST.
+
+## SPJ response to concern re extra noise in AST
+
+>
+> I thiink the key question is whether it is acceptable to sprinkle this kind of information throughout the AST. For someone interested in source-to-source conversions (like me) this is great, others may find it intrusive.
+
+
+It’s probably not too bad if you use record syntax; thus
+
+```
+|HsDo{ hsdo_do_loc ::SrcSpan-- of the word "do", hsdo_blocks ::BlockSrcSpans, hsdo_ctxt   ::HsStmtContextName, hsdo_stmts  ::[ExprLStmt id], hsdo_type    ::PostTcType}
+```
+
+## Other issues
+
+
+The AST is initially created by the parser, and then changed through the renamer and type checker.
+
+
+From a source to source conversion perspective the `ParsedSource` is closest to the original source code, as it respects the original linear ordering of the declarations, which are each wrapped in an appropriate constructor from `HsDecl`.
+
+
+The `RenamedSource` gathers all the like declarations together, and strips out the `HsDecl`, as well as re-ordering binds to appear in dependency order.
+
+
+The `TypeCheckedSource` further changes the `RenamedSource` to replace the original type information with the calculated types.
+
+
+So manipulations need to happen at the `ParsedSource` level, but with the ability to query information from the `RenamedSource` or `TypecheckedSource` as required.
+
+
+At the moment HaRe manages this by building up a token tree indexed by SrcSpan with tokens at the leaves, constructed from the `ParsedSource`, and then indexes into it for changes based on the `RenamedSource`.  The token tree is fiddly and brittle, so it would be better to keep this information directy in the AST.
+
+## Abortive annotation parameter attempt
+
+[ D246](https://phabricator.haskell.org/D246) captures an attempt to work through a type parameter. This exploded in complexity, and was abandoned.
+
+## SPJ alternative suggestion
+
+>
+> Another way to tackle this would be to ensure that syntax tree nodes have a "node-key" (a bit like their source location) that clients could use in a finite map, to map node-key to values of their choice.
+
+
+An initial investigation shows some complexity blow up too.  The same effect can be achieved with a virtual node key.
+
+## AZ Virtual node key proposal
+
+
+Instead of physically placing a "node-key" in each AST Node, a virtual
+node key can be generated from any `GenLocated SrcSpan e` comprising a
+combination of the `SrcSpan` value and a unique identifier from the
+constructor for `e`, perhaps using its `TypeRep`, since the entire AST
+derives Typeable.
+
+
+To further reduce the intrusiveness, a base Annotation type can be
+defined that captures the location of noise tokens for each AST
+constructor. This can then be emitted from the parser, if the
+appropriate flag is set to enable it.
+
+
+So
+
+```
+dataApiAnnKey=AKSrcSpanTypeRep
+
+    mkApiAnnKey ::(Located e)->ApiAnnKey
+    mkApiAnnKey =...dataAnn=....|AnnHsLetSrcSpan-- of the word "let"SrcSpan-- of the word "in"|AnnHsDoSrcSpan-- of the word "do"
+```
+
+
+And then in the parser
+
+```
+| 'let' binds 'in' exp   { mkAnnHsLet $1$3(LL$HsLet(unLoc $2)$4)}
+```
+
+
+The helper is
+
+```
+
+    mkAnnHsLet ::Located a ->Located b ->LHsExprRdrName->P(LHsExprRdrName)
+    mkAnnHsLet (L l_let _)(L l_in _) e =do
+      addAnnotation (mkAnnKey e)(AnnHsLet l_let l_in)
+      return e;
+```
+
+
+The Parse Monad would have to accumulate the annotations to be
+returned at the end, if called with the appropriate flag.
+
+
+There will be some boilerplate in getting the annotations and helper
+functions defined, but it will not pollute the rest.
+
+
+This technique can also potentially be backported to support older GHC
+versions via a modification to ghc-parser\[1\].
+
+[ https://hackage.haskell.org/package/ghc-parser](https://hackage.haskell.org/package/ghc-parser)
+
+## Neil Mitchell Response
+
+
+I was getting a bit lost between the idea and the implementation. Let
+me try rephrasing the idea in my own words.
+
+
+The goal: Capture inner source spans in AST syntax nodes. At the
+moment if ... then ... else ... captures the spans \[if \[...\] then
+\[...\] else \[...\]\]. We want to capture the spans for each keyword as
+well, so: \[{if} \[...\] {then} \[...\] {else} \[...\]\].
+
+
+The proposal: Rather than add anything to the AST, have a separate
+mapping `(SrcSpan,AstCtor)` to `[SrcSpan]`. So you give in the `SrcSpan`
+from the `IfThenElse` node, and some token for the `IfThenElse`
+constructor, and get back a list of `IfThenElse` for the particular
+keyword.
+
+
+I like the proposal because it adds nothing inside the AST, and
+requires no fresh invariants of the AST. I dislike it because the
+contents of that separate mapping are highly tied up with the AST, and
+easy to get out of sync. I think it's the right choice for three
+reasons, 1) it is easier to try out and doesn't break the AST, so we
+have more scope for changing our minds later; 2) the same technique is
+able to represent things other than `SrcSpan` without introducing a
+polymorphic src span; 3) the people who pay the complexity are the
+people who use it, which is relatively few people.
+
+
+That said, as a tweak to the API, rather than a single data type for
+all annotations, you could have:
+
+```
+dataAnnIfThenElse=AnnIfThenElse{posIf, posThen, posElse ::SrcSpan}dataAnnDo=AnnDo{posDo ::SrcSpan}
+```
+
+
+Then you could just have an opaque `Map (SrcSpan, TypeRep) Dynamic`,
+with the invariant that the `TypeRep` in the key matches the `Dynamic`.
+Then you can have: 
+
+```
+getAnnotation::Typeable a =>Annotations->SrcSpan->Maybe a
+```
+
+
+I think it simplifies some of the TypeRep trickery
+you are engaging in with `mkAnnKey`.
+
+
+There was some further email between AZ and NDM (teaching AZ some basics) resulting in the following
+
+
+This allows code using the annotation to access this as follows
+
+```
+processHsLet::ApiAnns->LHsExpr->CustomReturnTypeprocessHsLet anns (L l (HsExpr localBinds expr))= r
+  whereJust ann = getAnnotation anns l ::MaybeAnnHsLet...
+```
+
+
+Key data structures
+
+```
+typeApiAnns=Map.MapApiAnnKeyValuedataApiAnnKey=AKSrcSpanTypeRepderiving(Eq,Ord,Show)mkApiAnnKey::(Typeable a)=>SrcSpan-> a ->ApiAnnKeymkApiAnnKey l a =AK l (typeOf (Just a))dataValue= forall a .(Eq a,Show a,Typeable a,Outputable a)=>Value a
+
+newValue::(Eq a,Show a,Typeable a,Outputable a)=> a ->ValuenewValue=ValuetypeValue::Value->TypeReptypeValue(Value x)= typeOf x
+
+fromValue::Typeable a =>Value-> a
+fromValue(Value x)= fromMaybe (error errMsg)$ res
+  where
+    res = cast x
+    errMsg ="fromValue, bad cast from "++ show (typeOf x)++" to "++ show (typeOf res)
+```
+
+
+Note that the `Value` type is based on the one in [ shake](https://github.com/ndmitchell/shake/blob/master/Development/Shake/Value.hs)
+
+
+This allows the annotation to be retrieved by
+
+```
+-- | Retrieve an annotation based on the SrcSpan of the annotated AST-- element, and the known type of the annotation.getAnnotation::(Typeable a)=>ApiAnns->SrcSpan->Maybe a
+getAnnotation anns span = res
+  where res =caseMap.lookup (AK span (typeOf res)) anns ofNothing->NothingJust d ->Just$ fromValue d
+```
+
+### Annotation structures
+
+
+Each annotation is a separate data structure, named specifically for the constructor of the AST element being retrieved.
+
+
+So if we have an AST element `L l ConstructorXXX` the corresponding annotation will be called `AnnConstructorXXX`.
+
+
+An examples
+
+```
+-- TyClDecldataAnnClassDecl=AnnClassDecl{ aclassdecl_class   ::SrcSpan, aclassdecl_mwhere  ::MaybeSrcSpan, aclassdecl_mbraces ::Maybe(SrcSpan,SrcSpan)}deriving(Eq,Data,Typeable,Show)
+```
+
+## Open Questions (AZ)
+
+
+I am currently working annotations into the parser, provided them as a separate structure at the end of the parse, indexed to the original by SrcSpan and AST element type.
+
+
+The question I have is how to capture commas and semicolons in lists of items.
+
+
+There are at least three ways of doing this
+
+1. Make sure each of the items is Located, and add the possible comma location to the annotation structure for it.
+
+
+This has the drawback that all instances of the AST item annotation have the possible comma location in them, and it does not cope with multiple separators where these are allowed.
+
+1. Introduce a new hsSyn structure to explicitly capture comma-separated lists.
+
+
+This is the current approach I am taking, modelled on the OrdList implementation, but with an extra constructor to capture the separator location.
+
+
+Thus
+
+```
+dataHsCommaList a
+  =Empty|Cons a (HsCommaList a)|ExtraCommaSrcSpan(HsCommaList a)-- ^ We need a SrcSpan for the annotation|Snoc(HsCommaList a) a
+  |Two(HsCommaList a)-- Invariant: non-empty(HsCommaList a)-- Invariant: non-empty
+```
+
+1. Change the lists to be of type `[Either SrcSpan a]` to explicitly capture the comma locations in the list.
+
+1. A fourth way is to add a list of SrcSpan to the annotation for the parent structure of the list, simply tracking the comma positions. This will make working with the annotations complicated though.
+
+
+I am currently proceeding with option 2, but would appreciate some comment on whether this is the best approach to take. 
+
+
+Option 2 will allow the AST to capture the extra commas in record constructors, as suggested by SPJ in the debate on that feature.
+
+
+However, the structure is being misused in that `ExtraComma` is used to capture ALL commas, as well as semicolons in the `{ .. ; .. }` idiom.
+
+## Update 2014-10-12
+
+
+Based on further feedback from Neil Mitchell and SPJ, the basic annotation is now
+
+```
+typeApiAnns=Map.MapApiAnnKeySrcSpandataApiAnnKey=AKSrcSpanAnnderiving(Eq,Ord,Show)-- ----------------------------------------------------------------------- | Retrieve an annotation based on the SrcSpan of the annotated AST-- element, and the known type of the annotation.getAnnotation::ApiAnns->SrcSpan->Ann->MaybeSrcSpangetAnnotation anns span ann =Map.lookup (AK span ann) anns
+
+-- ---------------------------------------------------------------------- | Note: in general the names of these are taken from the-- corresponding token, unless otherwise noteddataAnn=AnnAs|AnnBang|AnnClass|AnnClose-- ^ } or ] or ) or #) etc|AnnComma|AnnDarrow|AnnData|AnnDcolon....
+```
+
+
+This is a lot simpler than before.