Skip to content

Make Read instances for Integral types faster, and make them fail fast

At present, the Read instances for standard types are generally written as though Read were a typical parser for a programming language. However, it most assuredly is not. Specifically, we currently follow the pattern of lexing and then parsing. The first problem with that is that while we follow that model, we don't actually get the big benefit of that model; the class methods don't mention or respect token boundaries, so lexing first doesn't prevent backtracking. More to the point, perhaps, lexing looks at the string and attempts to identify the next token, whatever it may be. But for Read, we don't need that. Thanks to the types, we have a very clear sense of what characters we expect to encounter.

I've started to sketch out an improvement, which reads Int and Word around seven times as fast (9.5 times as fast with parentheses and negation), and will fail immediately on something like read (fix ('a':)) :: Int rather than going into an infinite loop.

To begin with, modify the definition of paren thus:

expectP' :: Char -> ReadPrec ()
expectP' c = lift (expect' c)
{-# INLINE expectP' #-}

expect' :: Char -> ReadP ()
expect' c = do
  ReadP.skipSpaces
  thing <- ReadP.get
  if thing == c
    then pure ()
    else ReadP.pfail
{-# INLINE expect' #-}


paren :: ReadPrec a -> ReadPrec a
-- ^ @(paren p)@ parses \"(P0)\"
--      where @p@ parses \"P0\" in precedence context zero
paren p = do expectP' '('
             x <- reset p
             expectP' ')'
             return x

This allows fast failure when looking for parentheses, so we don't have to scan to the end of the first token (whatever it may be) before concluding that it is not '('.

Now we can parse Word and Int very efficiently. I had to specialize earlier than I wanted to convince GHC that I don't want to convert through Integer. I'm not sure why the fromIntegral rule doesn't fire reliably around here. The code below (temporarily) uses the current definition for base 16, because that's a bit fussy; I'll rewrite it soon.

charDiff :: Char -> Char -> Word
charDiff c1 c2 = fromIntegral (ord c1 - ord c2)
{-# INLINE charDiff #-}

readHexOct :: ReadP Word
readHexOct = do
  _ <- ReadP.char '0'
  baseId <- lexBaseChar
  case baseId of
    Oct -> readBaseP 8
    Hex -> L.readHexP
{-# INLINE readHexOct #-}

data BaseId = Oct | Hex

lexBaseChar :: ReadP BaseId
lexBaseChar = do { c <- ReadP.get;
                   case c of
                     'o' -> pure Oct
                     'O' -> pure Oct
                     'x' -> pure Hex
                     'X' -> pure Hex
                     _ -> ReadP.pfail }

readWord :: ReadP Word
readWord = readNumber (readHexOct ReadP.<++ readBaseP 10)

readInt :: ReadP Int
readInt = fromIntegral <$> readWord

readBaseP :: Integral a => Word -> ReadP a
readBaseP !base = do
  c <- ReadP.get
  let diff = charDiff c '0'
  if diff < base
    then readBaseP' base (fromIntegral diff)
    else ReadP.pfail
{-# INLINE readBaseP #-}

readBaseP' :: Integral a => Word -> a -> ReadP a
readBaseP' !base !acc0 = ReadP.look >>= go acc0
  where
    go !acc (c:cs) | diff < base = ReadP.get *> go (fromIntegral base * acc + fromIntegral diff) cs
      where diff = charDiff c '0'
    go !acc _ = pure acc
{-# INLINE readBaseP' #-}

readNumber :: Num a => ReadP.ReadP a -> ReadPrec a
readNumber p = parens $ do
  cs <- lift skipSpaces *> look
  case cs of
    ('-': _) -> get *> lift (skipSpaces *> (negate <$> p))
    _ -> lift $ skipSpaces *> p
{-# INLINE readNumber #-}

Other Word and Int-like types can be built on top of that foundation. I haven't yet attempted to deal with other instances, but I think there are probably a lot of opportunities for similar improvements.

WARNING: I know very little about parsing, and less about ReadPrec. It's conceivable that I've made some semantic errors here, although I don't think I have.

Edited by YitzGale
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information