Lexer: lex_string_tok is horrible

In a recent mail, @rae made me aware of how GHC lexes strings:

-- Strings and chars are lexed by hand-written code.  The reason is
-- that even if we recognise the string or char here in the regex
-- lexer, we would still have to parse the string afterward in order
-- to convert it to a String.
<0> {
  \'                            { lex_char_tok }
  \"                            { lex_string_tok }
}

....

------------------------------------------
-- Strings & Chars

-- This stuff is horrible.  I hates it.    --- <--- Simon M., 2003

lex_string_tok :: Action
lex_string_tok span buf _len = do
  tok <- lex_string ""
  (AI end bufEnd) <- getInput
  let
    tok' = case tok of
            ITprimstring _ bs -> ITprimstring (SourceText src) bs
            ITstring _ s -> ITstring (SourceText src) s
            _ -> panic "lex_string_tok"
    src = lexemeToString buf (cur bufEnd - cur buf)
  return (L (mkPsSpan (psSpanStart span) end) tok')

lex_string :: String -> P Token
lex_string s = do
  i <- getInput
  case alexGetChar' i of
    Nothing -> lit_error i

    Just ('"',i)  -> do
        setInput i
        let s' = reverse s
        magicHash <- getBit MagicHashBit
        if magicHash
          then do
            i <- getInput
            case alexGetChar' i of
              Just ('#',i) -> do
                setInput i
                when (any (> '\xFF') s') $ do
                  pState <- getPState
                  let err = PsError PsErrPrimStringInvalidChar [] (mkSrcSpanPs (last_loc pState))
                  addError err
                return (ITprimstring (SourceText s') (unsafeMkByteString s'))
              _other ->
                return (ITstring (SourceText s') (mkFastString s'))
          else
                return (ITstring (SourceText s') (mkFastString s'))

    Just ('\\',i)
        | Just ('&',i) <- next -> do
                setInput i; lex_string s
        | Just (c,i) <- next, c <= '\x7f' && is_space c -> do
                           -- is_space only works for <= '\x7f' (#3751, #5425)
                setInput i; lex_stringgap s
        where next = alexGetChar' i

    Just (c, i1) -> do
        case c of
          '\\' -> do setInput i1; c' <- lex_escape; lex_string (c':s)
          c | isAny c -> do setInput i1; lex_string (c:s)
          _other -> lit_error i

lex_stringgap :: String -> P Token
lex_stringgap s = do
  i <- getInput
  c <- getCharOrFail i
  case c of
    '\\' -> lex_string s
    c | c <= '\x7f' && is_space c -> lex_stringgap s
                           -- is_space only works for <= '\x7f' (#3751, #5425)
    _other -> lit_error i

Horrible indeed it is: the string lexer is completely hand-written! Obviously, writing down the FSM transitions by hand is very error prone and completely non-extensible.

Why was it not written within Alex' declarative language in the first place? I don't know for certain, but the comment that starts the above snippet suggests it's because of the need to escape the lexed strings, indeed a valid concern.

Nevertheless, I see two options here:

Escape literal strings in the parser or (better) in the AlexToken case of lexToken. Needs to retain the whole unescaped string and might lead to escaping the same string repeatedly.
Add lexer logic that escapes the string. Even the alex user guide gives an example for string parsing using "start codes" (3.2.2.2 here, which seems to be inspired by flex's "start conditions"), e.g. you can specify and "call" a sub-lexer for lexing the content of a string literal. GHC's lexer already seems to use start codes for layout, so the awareness of the feature apparently wasn't enough to use it for lexing strings. Indeed, it doesn't solve the escaping problem at all! On the other hand, supporting string escapes seems to be an explicit goal for flex's start condition as per the user guide. So maybe it's as simple as modifying the lookahead token when we see a \ in a <string> start code?

Some feedback from @simonmar would be helpful, who might recall the exact pitfalls involved with (2).

Edited Apr 30, 2021 by Ben Gamari

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information