Commit 64c9898f authored by thomie's avatar thomie Committed by Austin Seipp
Browse files

Make Lexer.x more like the 2010 report

Summary:
I tried reading the lexer and the 2010 report side-by-side. Althought I didn't
quite finish, here are some small discrepancies that I found.

This revision may be low priority for reviewers, but having these commits just
in my local repository does no good either.

Changes:
* $nl was defined, but not used anywhere
* formfeed is a newline character
* add \: to $ascsymbol
  For simplification reason, the colon (':') was added to the character
  set $ascsymbol in the 2010 report. Here we make the same change.
* introduce the macros `qvarid`, `qconid`, `qvarsym` and `qconsym`
* foreign is a Haskell keyword
* add/update comments

Test Plan: Harbormaster (is awesome)

Reviewers: simonmar, hvr, austin

Reviewed By: austin

Subscribers: hvr, simonmar, ezyang, carter

Differential Revision: https://phabricator.haskell.org/D180
parent 0f31c2e5
----------------------------------------------------------------------------- -----------------------------------------------------------------------------
-- (c) The University of Glasgow, 2006 -- (c) The University of Glasgow, 2006
-- --
-- GHC's lexer. -- GHC's lexer for Haskell 2010 [1].
-- --
-- This is a combination of an Alex-generated lexer from a regex -- This is a combination of an Alex-generated lexer [2] from a regex
-- definition, with some hand-coded bits. -- definition, with some hand-coded bits. [3]
-- --
-- Completely accurate information about token-spans within the source -- Completely accurate information about token-spans within the source
-- file is maintained. Every token has a start and end RealSrcLoc -- file is maintained. Every token has a start and end RealSrcLoc
-- attached to it. -- attached to it.
-- --
-- References:
-- [1] https://www.haskell.org/onlinereport/haskell2010/haskellch2.html
-- [2] http://www.haskell.org/alex/
-- [3] https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/Parser
--
----------------------------------------------------------------------------- -----------------------------------------------------------------------------
-- ToDo / known bugs: -- ToDo / known bugs:
...@@ -31,6 +36,10 @@ ...@@ -31,6 +36,10 @@
-- form? This is quite difficult to achieve. We don't do it for -- form? This is quite difficult to achieve. We don't do it for
-- qualified varids. -- qualified varids.
-- -----------------------------------------------------------------------------
-- Alex "Haskell code fragment top"
{ {
-- XXX The above flags turn off warnings in the generated code: -- XXX The above flags turn off warnings in the generated code:
{-# LANGUAGE BangPatterns #-} {-# LANGUAGE BangPatterns #-}
...@@ -91,48 +100,55 @@ import Data.Ratio ...@@ -91,48 +100,55 @@ import Data.Ratio
import Data.Word import Data.Word
} }
$unispace = \x05 -- Trick Alex into handling Unicode. See alexGetChar.
$whitechar = [\ \n\r\f\v $unispace] -- -----------------------------------------------------------------------------
$white_no_nl = $whitechar # \n -- Alex "Character set macros"
$unispace = \x05 -- Trick Alex into handling Unicode. See alexGetByte.
$nl = [\n\r\f]
$whitechar = [$nl\v\ $unispace]
$white_no_nl = $whitechar # \n -- TODO #8424
$tab = \t $tab = \t
$ascdigit = 0-9 $ascdigit = 0-9
$unidigit = \x03 -- Trick Alex into handling Unicode. See alexGetChar. $unidigit = \x03 -- Trick Alex into handling Unicode. See alexGetByte.
$decdigit = $ascdigit -- for now, should really be $digit (ToDo) $decdigit = $ascdigit -- for now, should really be $digit (ToDo)
$digit = [$ascdigit $unidigit] $digit = [$ascdigit $unidigit]
$special = [\(\)\,\;\[\]\`\{\}] $special = [\(\)\,\;\[\]\`\{\}]
$ascsymbol = [\!\#\$\%\&\*\+\.\/\<\=\>\?\@\\\^\|\-\~] $ascsymbol = [\!\#\$\%\&\*\+\.\/\<\=\>\?\@\\\^\|\-\~\:]
$unisymbol = \x04 -- Trick Alex into handling Unicode. See alexGetChar. $unisymbol = \x04 -- Trick Alex into handling Unicode. See alexGetByte.
$symbol = [$ascsymbol $unisymbol] # [$special \_\:\"\'] $symbol = [$ascsymbol $unisymbol] # [$special \_\"\']
$unilarge = \x01 -- Trick Alex into handling Unicode. See alexGetChar. $unilarge = \x01 -- Trick Alex into handling Unicode. See alexGetByte.
$asclarge = [A-Z] $asclarge = [A-Z]
$large = [$asclarge $unilarge] $large = [$asclarge $unilarge]
$unismall = \x02 -- Trick Alex into handling Unicode. See alexGetChar. $unismall = \x02 -- Trick Alex into handling Unicode. See alexGetByte.
$ascsmall = [a-z] $ascsmall = [a-z]
$small = [$ascsmall $unismall \_] $small = [$ascsmall $unismall \_]
$unigraphic = \x06 -- Trick Alex into handling Unicode. See alexGetChar. $unigraphic = \x06 -- Trick Alex into handling Unicode. See alexGetByte.
$graphic = [$small $large $symbol $digit $special $unigraphic \:\"\'] $graphic = [$small $large $symbol $digit $special $unigraphic \"\']
$binit = 0-1 $binit = 0-1
$octit = 0-7 $octit = 0-7
$hexit = [$decdigit A-F a-f] $hexit = [$decdigit A-F a-f]
$symchar = [$symbol \:]
$nl = [\n\r]
$idchar = [$small $large $digit \'] $idchar = [$small $large $digit \']
$pragmachar = [$small $large $digit] $pragmachar = [$small $large $digit]
$docsym = [\| \^ \* \$] $docsym = [\| \^ \* \$]
@varid = $small $idchar*
@conid = $large $idchar*
@varsym = $symbol $symchar* -- -----------------------------------------------------------------------------
@consym = \: $symchar* -- Alex "Regular expression macros"
@varid = $small $idchar* -- variable identifiers
@conid = $large $idchar* -- constructor identifiers
@varsym = ($symbol # \:) $symbol* -- variable (operator) symbol
@consym = \: $symbol* -- constructor (operator) symbol
@decimal = $decdigit+ @decimal = $decdigit+
@binary = $binit+ @binary = $binit+
...@@ -140,8 +156,11 @@ $docsym = [\| \^ \* \$] ...@@ -140,8 +156,11 @@ $docsym = [\| \^ \* \$]
@hexadecimal = $hexit+ @hexadecimal = $hexit+
@exponent = [eE] [\-\+]? @decimal @exponent = [eE] [\-\+]? @decimal
-- we support the hierarchical module name extension:
@qual = (@conid \.)+ @qual = (@conid \.)+
@qvarid = @qual @varid
@qconid = @qual @conid
@qvarsym = @qual @varsym
@qconsym = @qual @consym
@floating_point = @decimal \. @decimal @exponent? | @decimal @exponent @floating_point = @decimal \. @decimal @exponent? | @decimal @exponent
...@@ -150,9 +169,17 @@ $docsym = [\| \^ \* \$] ...@@ -150,9 +169,17 @@ $docsym = [\| \^ \* \$]
@negative = \- @negative = \-
@signed = @negative ? @signed = @negative ?
-- -----------------------------------------------------------------------------
-- Alex "Identifier"
haskell :- haskell :-
-- everywhere: skip whitespace and comments
-- -----------------------------------------------------------------------------
-- Alex "Rules"
-- everywhere: skip whitespace
$white_no_nl+ ; $white_no_nl+ ;
$tab+ { warn Opt_WarnTabs (text "Tab character") } $tab+ { warn Opt_WarnTabs (text "Tab character") }
...@@ -179,7 +206,7 @@ $tab+ { warn Opt_WarnTabs (text "Tab character") } ...@@ -179,7 +206,7 @@ $tab+ { warn Opt_WarnTabs (text "Tab character") }
-- have a Haddock comment). The rules then munch the rest of the line. -- have a Haddock comment). The rules then munch the rest of the line.
"-- " ~[$docsym \#] .* { lineCommentToken } "-- " ~[$docsym \#] .* { lineCommentToken }
"--" [^$symbol : \ ] .* { lineCommentToken } "--" [^$symbol \ ] .* { lineCommentToken }
-- Next, match Haddock comments if no -haddock flag -- Next, match Haddock comments if no -haddock flag
...@@ -191,7 +218,7 @@ $tab+ { warn Opt_WarnTabs (text "Tab character") } ...@@ -191,7 +218,7 @@ $tab+ { warn Opt_WarnTabs (text "Tab character") }
-- make sure that the first non-dash character isn't a symbol, and munch the -- make sure that the first non-dash character isn't a symbol, and munch the
-- rest of the line. -- rest of the line.
"---"\-* [^$symbol :] .* { lineCommentToken } "---"\-* ~$symbol .* { lineCommentToken }
-- Since the previous rules all match dashes followed by at least one -- Since the previous rules all match dashes followed by at least one
-- character, we also need to match a whole line filled with just dashes. -- character, we also need to match a whole line filled with just dashes.
...@@ -252,13 +279,13 @@ $tab+ { warn Opt_WarnTabs (text "Tab character") } ...@@ -252,13 +279,13 @@ $tab+ { warn Opt_WarnTabs (text "Tab character") }
-- single-line line pragmas, of the form -- single-line line pragmas, of the form
-- # <line> "<file>" <extra-stuff> \n -- # <line> "<file>" <extra-stuff> \n
<line_prag1> $decdigit+ { setLine line_prag1a } <line_prag1> @decimal { setLine line_prag1a }
<line_prag1a> \" [$graphic \ ]* \" { setFile line_prag1b } <line_prag1a> \" [$graphic \ ]* \" { setFile line_prag1b }
<line_prag1b> .* { pop } <line_prag1b> .* { pop }
-- Haskell-style line pragmas, of the form -- Haskell-style line pragmas, of the form
-- {-# LINE <line> "<file>" #-} -- {-# LINE <line> "<file>" #-}
<line_prag2> $decdigit+ { setLine line_prag2a } <line_prag2> @decimal { setLine line_prag2a }
<line_prag2a> \" [$graphic \ ]* \" { setFile line_prag2b } <line_prag2a> \" [$graphic \ ]* \" { setFile line_prag2b }
<line_prag2b> "#-}"|"-}" { pop } <line_prag2b> "#-}"|"-}" { pop }
-- NOTE: accept -} at the end of a LINE pragma, for compatibility -- NOTE: accept -} at the end of a LINE pragma, for compatibility
...@@ -341,8 +368,8 @@ $tab+ { warn Opt_WarnTabs (text "Tab character") } ...@@ -341,8 +368,8 @@ $tab+ { warn Opt_WarnTabs (text "Tab character") }
{ lex_quasiquote_tok } { lex_quasiquote_tok }
-- qualified quasi-quote (#5555) -- qualified quasi-quote (#5555)
"[" @qual @varid "|" / { ifExtension qqEnabled } "[" @qvarid "|" / { ifExtension qqEnabled }
{ lex_qquasiquote_tok } { lex_qquasiquote_tok }
} }
<0> { <0> {
...@@ -376,15 +403,15 @@ $tab+ { warn Opt_WarnTabs (text "Tab character") } ...@@ -376,15 +403,15 @@ $tab+ { warn Opt_WarnTabs (text "Tab character") }
} }
<0,option_prags> { <0,option_prags> {
@qual @varid { idtoken qvarid } @qvarid { idtoken qvarid }
@qual @conid { idtoken qconid } @qconid { idtoken qconid }
@varid { varid } @varid { varid }
@conid { idtoken conid } @conid { idtoken conid }
} }
<0> { <0> {
@qual @varid "#"+ / { ifExtension magicHashEnabled } { idtoken qvarid } @qvarid "#"+ / { ifExtension magicHashEnabled } { idtoken qvarid }
@qual @conid "#"+ / { ifExtension magicHashEnabled } { idtoken qconid } @qconid "#"+ / { ifExtension magicHashEnabled } { idtoken qconid }
@varid "#"+ / { ifExtension magicHashEnabled } { varid } @varid "#"+ / { ifExtension magicHashEnabled } { varid }
@conid "#"+ / { ifExtension magicHashEnabled } { idtoken conid } @conid "#"+ / { ifExtension magicHashEnabled } { idtoken conid }
} }
...@@ -392,8 +419,8 @@ $tab+ { warn Opt_WarnTabs (text "Tab character") } ...@@ -392,8 +419,8 @@ $tab+ { warn Opt_WarnTabs (text "Tab character") }
-- ToDo: - move `var` and (sym) into lexical syntax? -- ToDo: - move `var` and (sym) into lexical syntax?
-- - remove backquote from $special? -- - remove backquote from $special?
<0> { <0> {
@qual @varsym { idtoken qvarsym } @qvarsym { idtoken qvarsym }
@qual @consym { idtoken qconsym } @qconsym { idtoken qconsym }
@varsym { varsym } @varsym { varsym }
@consym { consym } @consym { consym }
} }
...@@ -453,6 +480,10 @@ $tab+ { warn Opt_WarnTabs (text "Tab character") } ...@@ -453,6 +480,10 @@ $tab+ { warn Opt_WarnTabs (text "Tab character") }
\" { lex_string_tok } \" { lex_string_tok }
} }
-- -----------------------------------------------------------------------------
-- Alex "Haskell code fragment bottom"
{ {
-- ----------------------------------------------------------------------------- -- -----------------------------------------------------------------------------
-- The token type -- The token type
...@@ -467,6 +498,7 @@ data Token ...@@ -467,6 +498,7 @@ data Token
| ITdo | ITdo
| ITelse | ITelse
| IThiding | IThiding
| ITforeign
| ITif | ITif
| ITimport | ITimport
| ITin | ITin
...@@ -484,7 +516,6 @@ data Token ...@@ -484,7 +516,6 @@ data Token
| ITwhere | ITwhere
| ITforall -- GHC extension keywords | ITforall -- GHC extension keywords
| ITforeign
| ITexport | ITexport
| ITlabel | ITlabel
| ITdynamic | ITdynamic
...@@ -1738,13 +1769,13 @@ alexGetByte (AI loc s) ...@@ -1738,13 +1769,13 @@ alexGetByte (AI loc s)
loc' = advanceSrcLoc loc c loc' = advanceSrcLoc loc c
byte = fromIntegral $ ord adj_c byte = fromIntegral $ ord adj_c
non_graphic = '\x0' non_graphic = '\x00'
upper = '\x1' upper = '\x01'
lower = '\x2' lower = '\x02'
digit = '\x3' digit = '\x03'
symbol = '\x4' symbol = '\x04'
space = '\x5' space = '\x05'
other_graphic = '\x6' other_graphic = '\x06'
adj_c adj_c
| c <= '\x06' = non_graphic | c <= '\x06' = non_graphic
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment