The completed feature is described at ApiAnnotations. This page now captures the discussion leading up to the implementation.
#9628 (closed)
This is a proposal / discussion page for adding annotations to the AST, for ticketRight now the locations of syntactic markers such as do
/let
/where
/in
/of
in the source are discarded from the AST, although they are retained in the rich token stream.
The haskell-src-exts
package deals with this by means of using the `SrcSpanInfo` data type which contains the SrcSpan as per the current GHC Located type but also has a list of SrcSpan
s for the syntactic markers, depending on the particular AST fragment being annotated.
The motivation for this change is then
-
Simplify the roundtripping and modification of source by explicitly capturing the missing location information for the syntactic markers.
-
Allow simple round-tripping of code via a GHC variant of hindent
-
Allow tools such as a future hlint and HaRe to make changes to the AST and then return the source unchanged except for the intended changes.
Current design
Current implementation
Note: This implementation has been split apart
The first part covers the AST changes only D426
The second covers the addition of annotations D438.
Also, D412 adds the original source text to literals that are converted during parsing.
The original proof of concept was D297
Theory of operation.
The HsSyn AST does not capture the locations of certain keywords and punctuation, such as 'let', 'in', 'do', etc.
These locations are required by any tools wanting to parse a haskell file, transform the AST in some way, and then regenerate the original layout for the unchaged parts.
Rather than pollute the AST with information irrelevant to the actual compilation process, these locations are captured in the lexer / parser and returned as a separate structure ApiAnns structure in the ParsedSource.
Each AST element that needs an annotation has an entry in this Map, which as a key comprising the SrcSpan of the original element and the TyeRep of the stored annotation, if it were wrapped in a Just.
This allows code using the annotation to access this as follows
processHsLet :: ApiAnns -> LHsExpr -> CustomReturnType
processHsLet anns (L l (HsExpr localBinds expr)) = r
where
Just letPos = getAnnotation anns l AnnLet
Just inPos = getAnnotation anns l AnnIn
...
Key data structures
type ApiAnns = Map.Map ApiAnnKey SrcSpan
data ApiAnnKey = AK SrcSpan Ann
deriving (Eq,Ord,Show)
This allows the annotation to be retrieved by
-- | Retrieve an annotation based on the SrcSpan of the annotated AST
-- element, and the known type of the annotation.
getAnnotation :: ApiAnns -> SrcSpan -> Ann -> Maybe SrcSpan
getAnnotation anns span ann = Map.lookup (AK span ann) anns
Annotation structures
Each annotation is simply a SrcSpan
.
-- | Note: in general the names of these are taken from the
-- corresponding token, unless otherwise noted
data Ann = AnnAs
| AnnAt
| AnnBang
| AnnBy
| AnnCase
| AnnClass
| AnnClose -- ^ } or ] or ) or #) etc
| AnnColon
| AnnColon2
..
Points to note:
-
AnnOpen
/AnnClose
capture all bracketed structures. -
In the few places where the same annotation type can repeat, multiple indices are provided, hence
AnnColon
andAnnColon2
. -
Where a value is being captured via e.g.
getINTEGER
the annotation index is calledAnnVal
.
Capturing in the parser
The annotations are captured in the lexer / parser by extending PState
to include a field
data PState = PState {
...
annotations :: [(ApiAnnKey,SrcSpan)]
}
The lexer exposes a helper function to add an annotation
addAnnotation :: SrcSpan -> Ann -> SrcSpan -> P ()
addAnnotation l a v = P $ \s -> POk s {
annotations = ((AK l a), v) : annotations s
} ()
The parser also has some helper functions of the form
type MaybeAnn = Maybe (SrcSpan -> P ())
gl = getLoc
gj x = Just (gl x)
aa :: Located a -> (Ann,Located c) -> P (Located a)
aa a@(L l _) (b,s) = addAnnotation l b (gl s) >> return a
ams :: Located a -> [MaybeAnn] -> P (Located a)
ams a@(L l _) bs = (mapM_ (\a -> a l) $ catMaybes bs) >> return a
This allows the annotations to be added in the parser productions as follows
ctypedoc :: { LHsType RdrName }
: 'forall' tv_bndrs '.' ctypedoc {% hintExplicitForall (getLoc $1) >>
ams (LL $ mkExplicitHsForAllTy $2 (noLoc []) $4)
[mj AnnForall $1,mj AnnDot $3] }
| context '=>' ctypedoc {% ams (LL $ mkQualifiedHsForAllTy $1 $3)
[mj AnnDarrow $2] }
| ipvar '::' type {% ams (LL (HsIParamTy (unLoc $1) $3))
[mj AnnDcolon $2] }
| typedoc { $1 }
Parse result
data HsParsedModule = HsParsedModule {
hpm_module :: Located (HsModule RdrName),
hpm_src_files :: [FilePath],
-- ^ extra source files (e.g. from #includes). The lexer collects
-- these from '# <file> <line>' pragmas, which the C preprocessor
-- leaves behind. These files and their timestamps are stored in
-- the .hi file, so that we can force recompilation if any of
-- them change (#3589)
hpm_annotations :: ApiAnns
}
-- | The result of successful parsing.
data ParsedModule =
ParsedModule { pm_mod_summary :: ModSummary
, pm_parsed_source :: ParsedSource
, pm_extra_src_files :: [FilePath]
, pm_annotations :: ApiAnns }
Implications
This approach has minimal implications on the rest of GHC, except that some AST elements will require to be Located
to enable the annotation to be looked up.
Also, initial ./validate
tests show that haddock complains of increased memory usage, due to the extra information being captured in the AST. If this becomes a major problem a flag could be introduced when invoking the parser as to whether to actually capture the annotations or not.
Notes / Shortcomings
Currently the annotations are only guaranteed to apply to the ParsedSource
, the renaming process flattens some of the location hooks for
data HsType name
...
| HsRecTy [Located [ConDeclField name]] -- Only in data type declarations
...
-- and
HsDataDefn {
...
dd_cons :: [Located [LConDecl name]],
-- ^ Data constructors
...
Possible Extension for Comments
A possible extension to the above annotations would be to add comments as annotations, if enabled by the existing Opt_KeepRawTokenStream
flag.
This would add an additional structure to the PState
for comments indexed by SrcSpan
, and also a queue of pending comments.
The basic mechanism would be to modify lexToken
in Lexer.x
to accumulate the comments in the queue as parsing progressed, and when the helper function addComments
is called with a SrcSpan
argument it moves any comments in the queue that fit within the SrcSpan
into the comment annotation structure indexed by the given SrcSpan
.
These are then available for retrieval at the and, as with the existing annotations.
Notes on comments
-
In practical terms, the usage of comments by tools requires them to be attached in some way to the logical units to which they apply. A case in point is comments preceding a function. In general, this is a hard problem, which is attacked via heuristics in
hindent
andHaRe
. This comment allocation would aim for a mechanistic allocation to the lowest enclosingSrcSpan
, as a start. -
It is possible to put the lexer into
Opt_Haddock
mode, a flag which is currently unset whenOpt_KeepRawTokenStream
is enabled. If these were made inclusive, it would be possible to explicitly tag the comments as being of type haddock, so that at a future date the annotations could possibly be used directly by haddock, rather than the very complicated parsing rules currently in place.
Early design discussion
Richard Eisenberg response
For what it's worth, my thought is not to use SrcSpanInfo (which, to me, is the wrong way to slice the abstraction) but instead to add SrcSpan fields to the relevant nodes. For example:
| HsDo SrcSpan -- of the word "do"
BlockSrcSpans
(HsStmtContext Name) -- The parameterisation is unimportant
-- because in this context we never use
-- the PatGuard or ParStmt variant
[ExprLStmt id] -- "do":one or more stmts
PostTcType -- Type of the whole expression
...
data BlockSrcSpans = LayoutBlock Int -- the parameter is the indentation level
... -- stuff to track the appearance of any semicolons
| BracesBlock ... -- stuff to track the braces and semicolons
The way I understand it, the SrcSpanInfo proposal means that we would have lots of empty SrcSpanInfos, no? Most interior nodes don't need one, I think.
Popping up a level, I do support the idea of including this info in the AST.
SPJ response to concern re extra noise in AST
I thiink the key question is whether it is acceptable to sprinkle this kind of information throughout the AST. For someone interested in source-to-source conversions (like me) this is great, others may find it intrusive.
It’s probably not too bad if you use record syntax; thus
| HsDo { hsdo_do_loc :: SrcSpan -- of the word "do"
, hsdo_blocks :: BlockSrcSpans
, hsdo_ctxt :: HsStmtContext Name
, hsdo_stmts :: [ExprLStmt id]
, hsdo_type :: PostTcType }
Other issues
The AST is initially created by the parser, and then changed through the renamer and type checker.
From a source to source conversion perspective the ParsedSource
is closest to the original source code, as it respects the original linear ordering of the declarations, which are each wrapped in an appropriate constructor from HsDecl
.
The RenamedSource
gathers all the like declarations together, and strips out the HsDecl
, as well as re-ordering binds to appear in dependency order.
The TypeCheckedSource
further changes the RenamedSource
to replace the original type information with the calculated types.
So manipulations need to happen at the ParsedSource
level, but with the ability to query information from the RenamedSource
or TypecheckedSource
as required.
At the moment HaRe manages this by building up a token tree indexed by SrcSpan with tokens at the leaves, constructed from the ParsedSource
, and then indexes into it for changes based on the RenamedSource
. The token tree is fiddly and brittle, so it would be better to keep this information directy in the AST.
Abortive annotation parameter attempt
D246 captures an attempt to work through a type parameter. This exploded in complexity, and was abandoned.
SPJ alternative suggestion
Another way to tackle this would be to ensure that syntax tree nodes have a "node-key" (a bit like their source location) that clients could use in a finite map, to map node-key to values of their choice.
An initial investigation shows some complexity blow up too. The same effect can be achieved with a virtual node key.
AZ Virtual node key proposal
Instead of physically placing a "node-key" in each AST Node, a virtual
node key can be generated from any GenLocated SrcSpan e
comprising a
combination of the SrcSpan
value and a unique identifier from the
constructor for e
, perhaps using its TypeRep
, since the entire AST
derives Typeable.
To further reduce the intrusiveness, a base Annotation type can be defined that captures the location of noise tokens for each AST constructor. This can then be emitted from the parser, if the appropriate flag is set to enable it.
So
data ApiAnnKey = AK SrcSpan TypeRep
mkApiAnnKey :: (Located e) -> ApiAnnKey
mkApiAnnKey = ...
data Ann =
....
| AnnHsLet SrcSpan -- of the word "let"
SrcSpan -- of the word "in"
| AnnHsDo SrcSpan -- of the word "do"
And then in the parser
| 'let' binds 'in' exp { mkAnnHsLet $1 $3 (LL $ HsLet (unLoc $2) $4) }
The helper is
mkAnnHsLet :: Located a -> Located b -> LHsExpr RdrName -> P (LHsExpr RdrName)
mkAnnHsLet (L l_let _) (L l_in _) e = do
addAnnotation (mkAnnKey e) (AnnHsLet l_let l_in)
return e;
The Parse Monad would have to accumulate the annotations to be returned at the end, if called with the appropriate flag.
There will be some boilerplate in getting the annotations and helper functions defined, but it will not pollute the rest.
This technique can also potentially be backported to support older GHC versions via a modification to ghc-parser[1].
https://hackage.haskell.org/package/ghc-parser
Neil Mitchell Response
I was getting a bit lost between the idea and the implementation. Let me try rephrasing the idea in my own words.
The goal: Capture inner source spans in AST syntax nodes. At the moment if ... then ... else ... captures the spans [if [...] then [...] else [...]]. We want to capture the spans for each keyword as well, so: [{if} [...] {then} [...] {else} [...]].
The proposal: Rather than add anything to the AST, have a separate
mapping (SrcSpan,AstCtor)
to [SrcSpan]
. So you give in the SrcSpan
from the IfThenElse
node, and some token for the IfThenElse
constructor, and get back a list of IfThenElse
for the particular
keyword.
I like the proposal because it adds nothing inside the AST, and
requires no fresh invariants of the AST. I dislike it because the
contents of that separate mapping are highly tied up with the AST, and
easy to get out of sync. I think it's the right choice for three
reasons, 1) it is easier to try out and doesn't break the AST, so we
have more scope for changing our minds later; 2) the same technique is
able to represent things other than SrcSpan
without introducing a
polymorphic src span; 3) the people who pay the complexity are the
people who use it, which is relatively few people.
That said, as a tweak to the API, rather than a single data type for all annotations, you could have:
data AnnIfThenElse = AnnIfThenElse {posIf, posThen, posElse :: SrcSpan}
data AnnDo = AnnDo {posDo :: SrcSpan}
Then you could just have an opaque Map (SrcSpan, TypeRep) Dynamic
,
with the invariant that the TypeRep
in the key matches the Dynamic
.
Then you can have:
getAnnotation :: Typeable a => Annotations -> SrcSpan -> Maybe a
I think it simplifies some of the TypeRep trickery
you are engaging in with mkAnnKey
.
There was some further email between AZ and NDM (teaching AZ some basics) resulting in the following
This allows code using the annotation to access this as follows
processHsLet :: ApiAnns -> LHsExpr -> CustomReturnType
processHsLet anns (L l (HsExpr localBinds expr)) = r
where
Just ann = getAnnotation anns l :: Maybe AnnHsLet
...
Key data structures
type ApiAnns = Map.Map ApiAnnKey Value
data ApiAnnKey = AK SrcSpan TypeRep
deriving (Eq,Ord,Show)
mkApiAnnKey :: (Typeable a) => SrcSpan -> a -> ApiAnnKey
mkApiAnnKey l a = AK l (typeOf (Just a))
data Value = forall a . (Eq a, Show a, Typeable a, Outputable a) => Value a
newValue :: (Eq a, Show a, Typeable a, Outputable a) => a -> Value
newValue = Value
typeValue :: Value -> TypeRep
typeValue (Value x) = typeOf x
fromValue :: Typeable a => Value -> a
fromValue (Value x) = fromMaybe (error errMsg) $ res
where
res = cast x
errMsg = "fromValue, bad cast from " ++ show (typeOf x)
++ " to " ++ show (typeOf res)
Note that the Value
type is based on the one in shake
This allows the annotation to be retrieved by
-- | Retrieve an annotation based on the SrcSpan of the annotated AST
-- element, and the known type of the annotation.
getAnnotation :: (Typeable a) => ApiAnns -> SrcSpan -> Maybe a
getAnnotation anns span = res
where res = case Map.lookup (AK span (typeOf res)) anns of
Nothing -> Nothing
Just d -> Just $ fromValue d
Annotation structures
Each annotation is a separate data structure, named specifically for the constructor of the AST element being retrieved.
So if we have an AST element L l ConstructorXXX
the corresponding annotation will be called AnnConstructorXXX
.
An examples
-- TyClDecl
data AnnClassDecl = AnnClassDecl
{ aclassdecl_class :: SrcSpan
, aclassdecl_mwhere :: Maybe SrcSpan
, aclassdecl_mbraces :: Maybe (SrcSpan,SrcSpan) }
deriving (Eq,Data,Typeable,Show)
Open Questions (AZ)
I am currently working annotations into the parser, provided them as a separate structure at the end of the parse, indexed to the original by SrcSpan and AST element type.
The question I have is how to capture commas and semicolons in lists of items.
There are at least three ways of doing this
- Make sure each of the items is Located, and add the possible comma location to the annotation structure for it.
This has the drawback that all instances of the AST item annotation have the possible comma location in them, and it does not cope with multiple separators where these are allowed.
- Introduce a new hsSyn structure to explicitly capture comma-separated lists.
This is the current approach I am taking, modelled on the OrdList implementation, but with an extra constructor to capture the separator location.
Thus
data HsCommaList a
= Empty
| Cons a (HsCommaList a)
| ExtraComma SrcSpan (HsCommaList a)
-- ^ We need a SrcSpan for the annotation
| Snoc (HsCommaList a) a
| Two (HsCommaList a) -- Invariant: non-empty
(HsCommaList a) -- Invariant: non-empty
-
Change the lists to be of type
[Either SrcSpan a]
to explicitly capture the comma locations in the list. -
A fourth way is to add a list of SrcSpan to the annotation for the parent structure of the list, simply tracking the comma positions. This will make working with the annotations complicated though.
I am currently proceeding with option 2, but would appreciate some comment on whether this is the best approach to take.
Option 2 will allow the AST to capture the extra commas in record constructors, as suggested by SPJ in the debate on that feature.
However, the structure is being misused in that ExtraComma
is used to capture ALL commas, as well as semicolons in the { .. ; .. }
idiom.
Update 2014-10-12
Based on further feedback from Neil Mitchell and SPJ, the basic annotation is now
type ApiAnns = Map.Map ApiAnnKey SrcSpan
data ApiAnnKey = AK SrcSpan Ann
deriving (Eq,Ord,Show)
-- ---------------------------------------------------------------------
-- | Retrieve an annotation based on the SrcSpan of the annotated AST
-- element, and the known type of the annotation.
getAnnotation :: ApiAnns -> SrcSpan -> Ann -> Maybe SrcSpan
getAnnotation anns span ann = Map.lookup (AK span ann) anns
-- --------------------------------------------------------------------
-- | Note: in general the names of these are taken from the
-- corresponding token, unless otherwise noted
data Ann = AnnAs
| AnnBang
| AnnClass
| AnnClose -- ^ } or ] or ) or #) etc
| AnnComma
| AnnDarrow
| AnnData
| AnnDcolon
....
This is a lot simpler than before.