Lexer: lex_string_tok is horrible
In a recent mail, @rae made me aware of how GHC lexes strings:
-- Strings and chars are lexed by hand-written code. The reason is
-- that even if we recognise the string or char here in the regex
-- lexer, we would still have to parse the string afterward in order
-- to convert it to a String.
<0> {
\' { lex_char_tok }
\" { lex_string_tok }
}
....
------------------------------------------
-- Strings & Chars
-- This stuff is horrible. I hates it. --- <--- Simon M., 2003
lex_string_tok :: Action
lex_string_tok span buf _len = do
tok <- lex_string ""
(AI end bufEnd) <- getInput
let
tok' = case tok of
ITprimstring _ bs -> ITprimstring (SourceText src) bs
ITstring _ s -> ITstring (SourceText src) s
_ -> panic "lex_string_tok"
src = lexemeToString buf (cur bufEnd - cur buf)
return (L (mkPsSpan (psSpanStart span) end) tok')
lex_string :: String -> P Token
lex_string s = do
i <- getInput
case alexGetChar' i of
Nothing -> lit_error i
Just ('"',i) -> do
setInput i
let s' = reverse s
magicHash <- getBit MagicHashBit
if magicHash
then do
i <- getInput
case alexGetChar' i of
Just ('#',i) -> do
setInput i
when (any (> '\xFF') s') $ do
pState <- getPState
let err = PsError PsErrPrimStringInvalidChar [] (mkSrcSpanPs (last_loc pState))
addError err
return (ITprimstring (SourceText s') (unsafeMkByteString s'))
_other ->
return (ITstring (SourceText s') (mkFastString s'))
else
return (ITstring (SourceText s') (mkFastString s'))
Just ('\\',i)
| Just ('&',i) <- next -> do
setInput i; lex_string s
| Just (c,i) <- next, c <= '\x7f' && is_space c -> do
-- is_space only works for <= '\x7f' (#3751, #5425)
setInput i; lex_stringgap s
where next = alexGetChar' i
Just (c, i1) -> do
case c of
'\\' -> do setInput i1; c' <- lex_escape; lex_string (c':s)
c | isAny c -> do setInput i1; lex_string (c:s)
_other -> lit_error i
lex_stringgap :: String -> P Token
lex_stringgap s = do
i <- getInput
c <- getCharOrFail i
case c of
'\\' -> lex_string s
c | c <= '\x7f' && is_space c -> lex_stringgap s
-- is_space only works for <= '\x7f' (#3751, #5425)
_other -> lit_error i
Horrible indeed it is: the string lexer is completely hand-written! Obviously, writing down the FSM transitions by hand is very error prone and completely non-extensible.
Why was it not written within Alex' declarative language in the first place? I don't know for certain, but the comment that starts the above snippet suggests it's because of the need to escape the lexed strings, indeed a valid concern.
Nevertheless, I see two options here:
- Escape literal strings in the parser or (better) in the
AlexToken
case oflexToken
. Needs to retain the whole unescaped string and might lead to escaping the same string repeatedly. - Add lexer logic that escapes the string. Even the alex user guide gives an example for string parsing using "start codes" (3.2.2.2 here, which seems to be inspired by flex's "start conditions"), e.g. you can specify and "call" a sub-lexer for lexing the content of a string literal. GHC's lexer already seems to use start codes for layout, so the awareness of the feature apparently wasn't enough to use it for lexing strings. Indeed, it doesn't solve the escaping problem at all! On the other hand, supporting string escapes seems to be an explicit goal for flex's start condition as per the user guide. So maybe it's as simple as modifying the lookahead token when we see a
\
in a<string>
start code?
Some feedback from @simonmar would be helpful, who might recall the exact pitfalls involved with (2).